Phylogenomic analyses data of the avian phylogenomics project

Background Determining the evolutionary relationships among the major lineages of extant birds has been one of the biggest challenges in systematic biology. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders. We used these genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomic analyses. Findings Here we present the datasets associated with the phylogenomic analyses, which include sequence alignment files consisting of nucleotides, amino acids, indels, and transposable elements, as well as tree files containing gene trees and species trees. Inferring an accurate phylogeny required generating: 1) A well annotated data set across species based on genome synteny; 2) Alignments with unaligned or incorrectly overaligned sequences filtered out; and 3) Diverse data sets, including genes and their inferred trees, indels, and transposable elements. Our total evidence nucleotide tree (TENT) data set (consisting of exons, introns, and UCEs) gave what we consider our most reliable species tree when using the concatenation-based ExaML algorithm or when using statistical binning with the coalescence-based MP-EST algorithm (which we refer to as MP-EST*). Other data sets, such as the coding sequence of some exons, revealed other properties of genome evolution, namely convergence. Conclusions The Avian Phylogenomics Project is the largest vertebrate phylogenomics project to date that we are aware of. The sequence, alignment, and tree data are expected to accelerate analyses in phylogenomics and other related areas. Electronic supplementary material The online version of this article (doi:10.1186/s13742-014-0038-1) contains supplementary material, which is available to authorized users.


D A T A N O T E Open Access
Phylogenomic analyses data of the avian phylogenomics project

Data description
Here we present FASTA files of loci, sequence alignments, indels, transposable elements, and Newick files of gene trees and species trees used in the Avian Phylogenomics Project [1][2][3][4]. We also include scripts used to process the data. The 48 species from which we collected these data span the phylogeny of modern birds, including representatives of all Neognathae (Neoaves and Galloanseres) and two of the five Palaeognathae orders (Table 1) [5][6][7].

Explanation of various data sets used to infer gene and species trees
Here we describe each locus data set in brief. Additional details are provided in Jarvis et al. [1].

protein-coding exon gene set
This is an exon-coding sequence data set of 8295 genes based on synteny-defined orthologs we identified and selected from the assembled genomes of chicken and zebra finch [8,9]. We required these loci to be present in at least 42 of the 48 avian species and outgroups, which allowed for missing data due to incomplete assemblies. To be included in the dataset, the exons in each genome assembly had to be 30% or more of the full-length sequence of the chicken or zebra finch ortholog. Annotated untranslated regions (UTRs) were trimmed off to remove non-coding sequence, in order to infer a codingonly sequence phylogeny. We note that 44 genes were identified with various problems such as gene annotation issues, and we removed them in the phylogenetic analyses. However, we provide them here in the unfiltered alignments.

protein amino acid alignment set
These are alignments of the translated peptide sequences for the 8295 protein-coding gene data set.

intron gene set
This is an orthologous subset of introns from the 8295 protein-coding genes among 52 species (includes outgroups). Introns with conserved annotated exon-intron boundaries between chicken and another species (±1 codon) were chosen. We filtered out introns with length < 50 bp or intron length ratio > 1.5 between chicken and another species or another species and chicken. This filtering resulted in a conservative subset of introns that could be reliably identified and aligned.

UCE locus set
This is the ultraconserved element (UCE) data set with 1000 bp flanking sequence at the 3′ and 5′ ends. The UCE dataset was filtered to remove overlap with the above exon and intron data sets, other exons and introns in the chicken genome assembly version 3, and overlapping sequences among the UCEs. The source UCE sequences used to search the genomes were determined from sequence capture probes [10][11][12] aligned to each avian genome assembly. Unlike the exon and intron data sets, we required that all 42 avian species and the alligator outgroup contain the UCEs. We found this requirement to be sufficient, because the central portions of UCEs are highly conserved across all species.

High and low variance introns and exons
These four data sets represent the 10% subsets of the 8295 exons and their associated introns when available (i.e. from the same genes) that had the highest and lowest variance in GC3 (third codon position) content across species. To calculate GC3 variance, we first calculated GC3 for each ortholog in each species, and then we used the correlation coefficient R to calculate variance in GC3 for each species. Orthologs were ranked by their GC3 variance and we selected the top and bottom 10% for analyses.

Supergenes
These are the concatenated sets of loci from various partitions of the TENT dataset (exons, introns, and UCEs described above), brought together using the statistical binning approach. The statistical binning approach put together sets of loci that were deemed "combinable". Two genes were considered combinable if their respective gene trees had no pairs of incompatible branches that had bootstrap support above a 50% threshold. Alignments of genes in the same bin were concatenated to form supergenes, but boundaries of genes were kept so that a gene-partitioned phylogenetic analysis could be performed on each supergene.

Whole genome alignment
Whole genome alignments were first created by a LASTZ + MULTIZ alignment [13,14] (http://www.bx.psu. edu/miller_lab/) across all 48 bird species and outgroups using individual chromosomes of the chicken genome as the reference (initial alignment 392,719,329 Mb). They were filtered for segments with fewer than 42 avian species (>5 missing bird species) and aberrant sequence alignments. The individual remaining segments of the MULTIZ alignment were realigned with MAFFT. We did not use SATé + MAFFT due to computational challenges (too much input/output was required).
Indel dataset 5.7 million insertions and deletions (indels) were scored as binary characters locus by locus from the same intron, exon, and UCE alignments as used in the TENT data set on the principle of simple indel coding using 2Xread [15,16] and then concatenated. Coding was verified using GapCoder [17] and by visual inspection of alignments for a small subset of data. Intron indels were scored on alignments that excluded non-avian outgroups (48 taxa), UCE indels were scored on alignments that included Alligator (49 taxa), and exons were scored on alignments that included all non-avian outgroups (52 taxa). Individual introns of the same gene were scored independently to avoid creating artifactual indels between concatenated intron or whole genome segments, whereas exons were concatenated as complete unigenes before scoring. For exons, indels >30 bp were excluded to avoid scoring missing exons as indels.

Transposable element markers
These are 61 manually curated presence/absence loci of transposable elements (TEs) present in the Barn Owl genome that exhibit presence at orthologous positions in one or more of the other avian species. The TE markers were identified by eye after a computational screening of 3,671 TguLTR5d retroposon insertions from the Barn Owl. For each TguLTR5d locus, we conducted BLASTn searches of TE-flanking sequences (1 kb per flank) against the remaining avian species and generated multispecies sequence alignments using MAFFT [18]. Redundant or potentially paralogous loci were excluded from analysis and the remaining marker candidates were carefully inspected using strict standard criteria for assigning presence/absence character states [19][20][21].

FASTA files of loci datasets in alignments
We provide the above loci data sets as FASTA files of both unfiltered and filtered sequence alignments. The alignments were filtered for aberrant over-and under-aligned sequences, and for the presence of the loci in 42 of the 48 avian species. All multiple sequence alignments were performed in two rounds. The first round was used to find contiguous portions of sequences that we identified as aberrant, and the second round was used to realign the filtered sequences. We used SATé [22,23] combined with either MAFFT [18] or PRANK [24] alignment algorithms, depending on the limitations of working with large datasets. Alignments without and with outgroups are made available.

Filtered loci sequence alignments Exon loci alignments
These are filtered alignments of exons from 8295 genes. Of these 8295, there were 42 genes that were identified to have annotation issues and we removed them from the phylogenetic analyses (the list is provided in the file FASTA_files_of_loci_datasets/Filtered_sequence_alignments/ 8295_Exons/42-exon-genes-removed.txt). Two more genes were removed because a gene tree could not be estimated for them. The first round of alignment was performed using SATé + PRANK, and the second round was performed using SATé + MAFFT. Before alignment, the nucleotide sequences were converted to amino acid sequences, and then reverted back to nucleotide sequences afterwards.  Listed are the scientific species name, English name, BioProject ID in the NCBI database for each genome (http://www.ncbi.nlm.nih.gov/bioproject), and GigaScience deposited genome sequences and raw reads. Full details are in [1,2].

Supergenes generated from statistical binning
These are concatenated alignments for each of our 2022 supergene alignments. We note that although supergenes are concatenated loci, we estimated supergene trees using partitioned analyses where each gene was put in a different partition. Thus, we also provide the boundaries between genes in text files (these can be directly used as partition input files to RAxML).
supergene-alignments.tar.bz2: supergene alignments with partition files showing genes put in each bin and their boundaries in the concatenated alignment

Unfiltered loci sequence alignments
These are individual loci alignments of the above data sets, before filtering. Amino.Acid.unfiltered

WGT.unfiltered
These are uploaded as part of the comparative genomics paper [2] data note [25], and a link is provided here https://github.com/gigascience/paper-zhang2014.

FASTA files of concatenated datasets in alignments
We provide FASTA files of concatenated sequence alignments of the above filtered loci datasets. These are concatenated alignments that were used in the ExaML and RAxML analyses [3].

List of scripts used in avian phylogenomics project
We also deposit the key scripts used in this project in GigaDB, which include: Script for filtering amino acid alignments Script for filtering nucleotide sequence alignments Script for mapping names from 5-letter codes to full names Scripts related to indel analyses We provide readme files in the script directories describing the usage of the scripts.

Availability and requirements
Project name: Avian Phylogenomic Project scripts Project home page: https://github.com/gigascience/ paper-jarvis2014; also see companion paper home page for related data https://github.com/gigascience/ paper-zhang2014 Operating system: Unix Programming language: R, Perl, python License: GNU GPL v3. Any restrictions to use by non-academics: none

Availability of supporting data
Other data files presented in this data note for the majority of genomes are available in the GigaScience repository, GigaDB [26] (Table 1), as well as NCBI (Table 1)