Abstract

Background

Penguins (Sphenisciformes) are a remarkable order of flightless wing-propelled diving seabirds distributed widely across the southern hemisphere. They share a volant common ancestor with Procellariiformes close to the Cretaceous-Paleogene boundary (66 million years ago) and subsequently lost the ability to fly but enhanced their diving capabilities. With ∼20 species among 6 genera, penguins range from the tropical Galápagos Islands to the oceanic temperate forests of New Zealand, the rocky coastlines of the sub-Antarctic islands, and the sea ice around Antarctica. To inhabit such diverse and extreme environments, penguins evolved many physiological and morphological adaptations. However, they are also highly sensitive to climate change. Therefore, penguins provide an exciting target system for understanding the evolutionary processes of speciation, adaptation, and demography. Genomic data are an emerging resource for addressing questions about such processes.

Results

Here we present a novel dataset of 19 high-coverage genomes that, together with 2 previously published genomes, encompass all extant penguin species. We also present a well-supported phylogeny to clarify the relationships among penguins. In contrast to recent studies, our results demonstrate that the genus Aptenodytes is basal and sister to all other extant penguin genera, providing intriguing new insights into the adaptation of penguins to Antarctica. As such, our dataset provides a novel resource for understanding the evolutionary history of penguins as a clade, as well as the fine-scale relationships of individual penguin lineages. Against this background, we introduce a major consortium of international scientists dedicated to studying these genomes. Moreover, we highlight emerging issues regarding ensuring legal and respectful indigenous consultation, particularly for genomic data originating from New Zealand Taonga species.

Conclusions

We believe that our dataset and project will be important for understanding evolution, increasing cultural heritage and guiding the conservation of this iconic southern hemisphere species assemblage.

Data Description

Context

Penguins (Sphenisciformes) are a unique order of seabirds distributed widely across the southern hemisphere (Fig. 1). Approximately 20 extant penguin species are recognized across 6 well-defined genera (Aptenodytes, Pygoscelis, Eudyptula, Spheniscus, Eudyptes, and Megadyptes [1–3]). Debate has surrounded species/lineage boundaries in a few key areas:

  1. Divisions between New Zealand little blue (Eudyptula minor minor), New Zealand white-flippered (Eudyptula minor albosignata), and Australian fairy penguins (Eudyptula novaehollandiae) [4–6].

  2. Divisions between northern rockhopper (Eudyptes moseleyi), western rockhopper (Eudytes chrysocome), and eastern rockhopper penguins (Eudyptes filholi) [3, 7, 8].

  3. Divisions between Fiordland crested (Eudyptes pachyrhynchus) and Snares crested penguins (Eudyptes robustus) [9, 10].

  4. Divisions between macaroni (Eudyptes chrysolophus chrysolophus) and royal penguins (Eudyptes chrysolophus schlegeli) [3, 8, 11].

Penguins have an extensive fossil record, with >50 extinct species documented to date [3, 12, 13], extending back >60 million years [12]. Extant penguins span a modest range of sizes [14, 15], with the emperor penguin (Aptenodytes forsteri) the largest (30 kg) and Eudyptula penguins the smallest (1 kg). In contrast, the fossil record reveals that many extinct penguin species were giants (surpassing 100 kg in body mass [13]).

Figure 1:

Locations of breeding colonies of penguins and sampling sites for the final genomes, adapted from Ksepka et al. [1]. Sampling locations are shown with a small white ellipse. Note that the sampling location of the humboldt penguin (Spheniscus humboldti) is unclear because this individual was bred in the Copenhagen zoo, with ancestors imported from Peru and Chile in 1972. AMS: Amsterdam Island; ANT: Antipodes Islands; AUC: Auckland Islands; BOU: Bouvet; CAM: Campbell Island; CHA: Chatham Islands; CRZ: Crozet; FAL: Falkland Islands/Malvinas; GAL: Galapagos Islands; GOU: Gough Island; HEA: Heard Island; KER: Kerguelen; MAC: Macquarie Island; NZ: New Zealand; PEI: Prince Edward/Marion Island; SG: South Georgia; SNA: The Snares; SO: South Orkney Islands; SS: South Sandwich Islands.

The radiation of penguins provides an excellent case study for researching biogeographic impacts on speciation processes. Penguins inhabit every major coastline in the southern hemisphere, and almost every island archipelago in the Southern Ocean [16]. Their range extends to unique ecological niches, from the tropical Galápagos Islands (Galápagos penguin, Spheniscus mendiculus) to the oceanic temperate forests of New Zealand (Eudyptes pachyrhynchus), rocky coastlines of the sub-Antarctic islands (E. filholi), and the sea ice around Antarctica (Aptenodytes forsteri) [17]. For this reason, penguins have evolved many unique adaptations, specific to the variety of ecological environments. Previous studies have suggested that global climate change during the Eocene [18, 19], substantial oceanographic currents [7], and geological island uplift [3] were key drivers of penguin diversification. Although the phylogenetic relationships within penguins are relatively well understood [1, 3, 18, 20], it remains uncertain which lineage first diverged from other penguins. Molecular analyses have differed on whether Aptenodytes, Pygoscelis, or both together represent the sister taxa to all other extant penguins [3]. Both of these genera are endemic to coastal Antarctica and Antarctic and subantarctic islands, and thus a sequential branching pattern would suggest a polar ancestral area for extant penguins. In contrast, morphological data and the fossil record suggest that the more temperate-adapted genus Spheniscus was the first to diverge [3, 20]. Understanding the evolutionary diversification of penguins in respect to geological and climatic changes remains a substantial gap in understanding the biogeographic history of these iconic birds.

Although penguins are tied to landmasses for breeding and nesting [21], all species spend most of their lives at sea [22] and are therefore important components of terrestrial, coastal, and marine ecosystems [23]. While some taxa inhabit environments with strong winds and extreme cold temperatures, experiencing seasonal fluctuations in the length of daylight across the breeding and chick-rearing seasons [24], others inhabit relatively temperate or even tropical climates, with little variation in day length. The unique morphological and physiological adaptations that have evolved within penguins include the complete loss of aerial flight, where penguins instead use their flipper-like wings in wing-propelled diving [25], densely packed waterproof and insulating feathers [26, 27], visual sensitivity of the eye lens for underwater predation [28–30], dense bones, stiff wing joints and reduced distal wing musculature to overcome buoyancy in water [31–33], enhanced thermoregulation for extreme low temperatures, long-term fasting, ability to digest secreted food, delayed digestion [34–40], different plumage [41] and crest ornaments [42], and catastrophic moult [43]. As such, penguins are an excellent system to study comparative evolution of adaptive traits.

Penguins are also sentinels of the Southern Ocean [16], being particularly sensitive to human and environmental change [44, 45]. Extensive demographic monitoring programs have indicated that many penguin species are declining in response to global warming [44–46], pollution, environmental degradation, and competition with fisheries, which are considered key drivers of these population declines [47–50]. Demographic coalescent models have demonstrated dramatic population declines during the Pleistocene ice ages, followed by rapid population expansions in response to global warming [51–54]. Future global warming is predicted to cause significant population declines [44, 55–57]. Understanding past demographic histories and inferring future demographic trajectories therefore remain important steps for predicting ecosystem-wide changes in this rapidly warming part of the planet.

Although penguins are a relatively well-studied group, previous evolutionary studies have been limited by the genetic markers used, such as short mitochondrial [2, 10, 58–60] or nuclear sequences [1, 8, 61, 62], microsatellites [63, 64], partial mitochondrial genomes [3, 65], or single-nucleotide polymorphisms [11, 53, 54, 66–68]. Several studies have hinted at associations between biological patterns and climate change [51–54, 60, 69]. Only a few studies have explored genome-wide evolutionary processes among penguins [51, 70] or between penguins and other birds [71–73], and these studies have focussed on just 2 Antarctic taxa: the Adélie penguin (Pygoscelis adeliae) and Aptenodytes forsteri. These previous studies have created a basic framework to understand the timing of penguin diversification, identify population fluctuations during past climate cycles, and have hinted at the molecular basis for a range of physiological and morphological adaptations [51]. The molecular genomic basis for the unique morphological and physiological adaptations of penguins, compared to other aquatic and terrestrial birds, remains largely unknown. No previous study has attempted to explore the evolution of all penguins under a comparative genomic or evolutionary framework. In this Data Note, we present 19 new high-quality genomes that, together with the 2 previously reported genomes [51], encompass all extant penguin species. We demonstrate the quality and application of this new dataset by constructing a well-supported phylogenomic tree of penguins. These data provide a critical resource for understanding the drivers of penguin evolution, the molecular basis of morphological and physiological adaptations, and demographic characteristics. For species naming, we follow standard nomenclature; however, for Eudyptula we follow Grosser et al. [5, 74] and for Eudyptes and Megadyptes we follow Cole et al. [3].

Methods

Sample collection, library construction, and sequencing

While it is possible to recover genome sequences from historical museum samples [75], such genomes are often of low quality and/or fragmented [76], limiting the ability of downstream analyses. Our project design (see below) relies on high-coverage genomes with little missing data (see Li et al. [51]). Therefore, we designed our sample collection to include only high-quality blood samples. We collected 94 blood samples spanning 19 different penguin species (1–28 samples per species; Supplementary Table 1). Samples were derived from the wild, zoological parks, or wildlife hospitals and were obtained according to strict permitting procedures, animal ethics, and consultation with indigenous representatives (Supplementary Table 1).

DNA was extracted from each sample at 1 of 3 laboratories as follows: we used the HiPire Blood DNA Midi Kit II at BGI (Hong Kong), the Qiagen DNeasy Blood and Tissue Kit (Qiagen, Valencia, CA, USA) at the University of Oxford (United Kingdom), and the KingFisher Cell and Tissue Kit in combination with the KingFisher Duo Prime Purification System at the University of Copenhagen (Denmark). All downstream methods were conducted at BGI. We diluted each DNA extraction to 20 μL using Tris-EDTA buffer. The quality and quantity of each DNA extraction was assessed by first estimating the concentration of 1 μL DNA extraction on a Microplate Reader, and DNA fragment size was evaluated by pulse gel electrophoresis or 1% agarose gel electrophoresis. Following quality control, a single sample per species was chosen for genomic library construction (Table 1).

Table 1:

Sample collection information for the 21 penguin genomes (including 2 obtained in Li et al. (51)

Latin nameCommon nameSample typeSampling locationSample labelDate extracted
Eudyptes chrysolophus schlegeliRoyalWildGreen Gorge, Macquarie Island4458October 2017
Eudyptes chrysolophus chrysolophusMacaroniWildMarion Island, Prince Edward IslandsMP PEI 1October 2017
Eudyptes pachyrhynchusFiordland-crestedWildHarrison Cove, Milford Sound, New Zealand South IslandMS 9May 2017
Eudyptes robustusSnares-crestedDunedin Wildlife HospitalThe Snares, New Zealand sub-Antarctic68M 28/09/13September 2018
Eudyptes sclateriErect-crestedWildAntipodes Island, New Zealand sub-AntarcticAnt 5September 2018
Eudyptes filholiEastern rockhopperWildCrozet IslandGS 12May 2016
Eudyptes chrysocomeWestern rockhopperWildFalkland Islands/MalvinasRH 110–1May 2016
Eudyptes moseleyiNorthern rockhopperWildAmsterdam IslandNRP 118–1May 2016
Megadyptes antipodes antipodesYellow-eyedWildOtago Peninsula, New Zealand South IslandOT 2 9/2/18August 2018
Spheniscus magellanicusMagellanicWildChiloe Island, ChileAH 6May 2016
Spheniscus demersusAfricanWildLuderitz, NamibiaAP 173July 2018
Spheniscus mendiculusGalápagosWildGalápagos IslandsGAPE 212October 2017
Spheniscus humboldtiHumboldtCopenhagen ZooPeru and Chile lineageZ-67–15October 2016
Eudyptula minor albosignataWhite-flipperedChristchurch Antarctic CentreBanks Peninsula, Canterbury, New Zealand South IslandFredJuly 2018
Eudyptula minor minorLittle blueNational Aquarium of New ZealandNew Zealand North IslandGonzoAugust 2018
Eudyptula novaehollandiaeFairyWildPhillip Island, Victoria, Australia10/9/18–1October 2018
Pygoscelis adeliaeAdélieWildInexpressible Island, Antarctica[51]NA
Pygoscelis papuaGentooWildWest Antarctic Peninsula, AntarcticaGentoo penguin DNA -4January 2018
Pygoscelis antarcticaChinstrapWildThule Island, South Sandwich IslandsCP TH 060November 2017
Aptenodytes patagonicusKingWildFortuna Bay, South GeorgiaKP FORT 001November 2017
Aptenodytes forsteriEmperorWildEmperor Island, Antarctica[51]NA
Latin nameCommon nameSample typeSampling locationSample labelDate extracted
Eudyptes chrysolophus schlegeliRoyalWildGreen Gorge, Macquarie Island4458October 2017
Eudyptes chrysolophus chrysolophusMacaroniWildMarion Island, Prince Edward IslandsMP PEI 1October 2017
Eudyptes pachyrhynchusFiordland-crestedWildHarrison Cove, Milford Sound, New Zealand South IslandMS 9May 2017
Eudyptes robustusSnares-crestedDunedin Wildlife HospitalThe Snares, New Zealand sub-Antarctic68M 28/09/13September 2018
Eudyptes sclateriErect-crestedWildAntipodes Island, New Zealand sub-AntarcticAnt 5September 2018
Eudyptes filholiEastern rockhopperWildCrozet IslandGS 12May 2016
Eudyptes chrysocomeWestern rockhopperWildFalkland Islands/MalvinasRH 110–1May 2016
Eudyptes moseleyiNorthern rockhopperWildAmsterdam IslandNRP 118–1May 2016
Megadyptes antipodes antipodesYellow-eyedWildOtago Peninsula, New Zealand South IslandOT 2 9/2/18August 2018
Spheniscus magellanicusMagellanicWildChiloe Island, ChileAH 6May 2016
Spheniscus demersusAfricanWildLuderitz, NamibiaAP 173July 2018
Spheniscus mendiculusGalápagosWildGalápagos IslandsGAPE 212October 2017
Spheniscus humboldtiHumboldtCopenhagen ZooPeru and Chile lineageZ-67–15October 2016
Eudyptula minor albosignataWhite-flipperedChristchurch Antarctic CentreBanks Peninsula, Canterbury, New Zealand South IslandFredJuly 2018
Eudyptula minor minorLittle blueNational Aquarium of New ZealandNew Zealand North IslandGonzoAugust 2018
Eudyptula novaehollandiaeFairyWildPhillip Island, Victoria, Australia10/9/18–1October 2018
Pygoscelis adeliaeAdélieWildInexpressible Island, Antarctica[51]NA
Pygoscelis papuaGentooWildWest Antarctic Peninsula, AntarcticaGentoo penguin DNA -4January 2018
Pygoscelis antarcticaChinstrapWildThule Island, South Sandwich IslandsCP TH 060November 2017
Aptenodytes patagonicusKingWildFortuna Bay, South GeorgiaKP FORT 001November 2017
Aptenodytes forsteriEmperorWildEmperor Island, Antarctica[51]NA
Table 1:

Sample collection information for the 21 penguin genomes (including 2 obtained in Li et al. (51)

Latin nameCommon nameSample typeSampling locationSample labelDate extracted
Eudyptes chrysolophus schlegeliRoyalWildGreen Gorge, Macquarie Island4458October 2017
Eudyptes chrysolophus chrysolophusMacaroniWildMarion Island, Prince Edward IslandsMP PEI 1October 2017
Eudyptes pachyrhynchusFiordland-crestedWildHarrison Cove, Milford Sound, New Zealand South IslandMS 9May 2017
Eudyptes robustusSnares-crestedDunedin Wildlife HospitalThe Snares, New Zealand sub-Antarctic68M 28/09/13September 2018
Eudyptes sclateriErect-crestedWildAntipodes Island, New Zealand sub-AntarcticAnt 5September 2018
Eudyptes filholiEastern rockhopperWildCrozet IslandGS 12May 2016
Eudyptes chrysocomeWestern rockhopperWildFalkland Islands/MalvinasRH 110–1May 2016
Eudyptes moseleyiNorthern rockhopperWildAmsterdam IslandNRP 118–1May 2016
Megadyptes antipodes antipodesYellow-eyedWildOtago Peninsula, New Zealand South IslandOT 2 9/2/18August 2018
Spheniscus magellanicusMagellanicWildChiloe Island, ChileAH 6May 2016
Spheniscus demersusAfricanWildLuderitz, NamibiaAP 173July 2018
Spheniscus mendiculusGalápagosWildGalápagos IslandsGAPE 212October 2017
Spheniscus humboldtiHumboldtCopenhagen ZooPeru and Chile lineageZ-67–15October 2016
Eudyptula minor albosignataWhite-flipperedChristchurch Antarctic CentreBanks Peninsula, Canterbury, New Zealand South IslandFredJuly 2018
Eudyptula minor minorLittle blueNational Aquarium of New ZealandNew Zealand North IslandGonzoAugust 2018
Eudyptula novaehollandiaeFairyWildPhillip Island, Victoria, Australia10/9/18–1October 2018
Pygoscelis adeliaeAdélieWildInexpressible Island, Antarctica[51]NA
Pygoscelis papuaGentooWildWest Antarctic Peninsula, AntarcticaGentoo penguin DNA -4January 2018
Pygoscelis antarcticaChinstrapWildThule Island, South Sandwich IslandsCP TH 060November 2017
Aptenodytes patagonicusKingWildFortuna Bay, South GeorgiaKP FORT 001November 2017
Aptenodytes forsteriEmperorWildEmperor Island, Antarctica[51]NA
Latin nameCommon nameSample typeSampling locationSample labelDate extracted
Eudyptes chrysolophus schlegeliRoyalWildGreen Gorge, Macquarie Island4458October 2017
Eudyptes chrysolophus chrysolophusMacaroniWildMarion Island, Prince Edward IslandsMP PEI 1October 2017
Eudyptes pachyrhynchusFiordland-crestedWildHarrison Cove, Milford Sound, New Zealand South IslandMS 9May 2017
Eudyptes robustusSnares-crestedDunedin Wildlife HospitalThe Snares, New Zealand sub-Antarctic68M 28/09/13September 2018
Eudyptes sclateriErect-crestedWildAntipodes Island, New Zealand sub-AntarcticAnt 5September 2018
Eudyptes filholiEastern rockhopperWildCrozet IslandGS 12May 2016
Eudyptes chrysocomeWestern rockhopperWildFalkland Islands/MalvinasRH 110–1May 2016
Eudyptes moseleyiNorthern rockhopperWildAmsterdam IslandNRP 118–1May 2016
Megadyptes antipodes antipodesYellow-eyedWildOtago Peninsula, New Zealand South IslandOT 2 9/2/18August 2018
Spheniscus magellanicusMagellanicWildChiloe Island, ChileAH 6May 2016
Spheniscus demersusAfricanWildLuderitz, NamibiaAP 173July 2018
Spheniscus mendiculusGalápagosWildGalápagos IslandsGAPE 212October 2017
Spheniscus humboldtiHumboldtCopenhagen ZooPeru and Chile lineageZ-67–15October 2016
Eudyptula minor albosignataWhite-flipperedChristchurch Antarctic CentreBanks Peninsula, Canterbury, New Zealand South IslandFredJuly 2018
Eudyptula minor minorLittle blueNational Aquarium of New ZealandNew Zealand North IslandGonzoAugust 2018
Eudyptula novaehollandiaeFairyWildPhillip Island, Victoria, Australia10/9/18–1October 2018
Pygoscelis adeliaeAdélieWildInexpressible Island, Antarctica[51]NA
Pygoscelis papuaGentooWildWest Antarctic Peninsula, AntarcticaGentoo penguin DNA -4January 2018
Pygoscelis antarcticaChinstrapWildThule Island, South Sandwich IslandsCP TH 060November 2017
Aptenodytes patagonicusKingWildFortuna Bay, South GeorgiaKP FORT 001November 2017
Aptenodytes forsteriEmperorWildEmperor Island, Antarctica[51]NA

We constructed 1 or more genomic libraries for each of the 19 penguin species depending on the DNA quality. For species that we could obtain high molecular weight DNA with the main band longer than 40 kb, we constructed 10X Genomics genomic libraries to produce 100× coverage sequencing data (Table 2). To do this, we attached a specific unique barcode to 1 end of short DNA fragments that are broken from 1 long DNA fragment, using standard protocols provided by Chromium™ Genome Solution. Because this protocol encompasses >1 million specific barcodes in a single solution, it decreases the chance of short DNA fragments with the same barcode being derived from unrelated long DNA fragments. For those species with shorter DNA fragments (<40 kb), we constructed genomic libraries following Illumina (San Diego, CA, [77]) or BGIseq 500 [78] protocols. Those protocols resulted in several paired-end libraries with insert sizes of either 250 or 500 bp, in addition to several mate-pair libraries with insert sizes ranging from 2 to 10 kb (Table 2). We further generated 100–320× coverage sequencing data for these species. Furthermore, we did not find any significant difference in the assembly quality between Illumina and BGIseq, while the 10x strategy normally produced better assembly than the other strategy with multiple insert-sized libraries (Table 3). Following sequencing, we generated 3.24 Tb sequencing reads encompassing all 19 penguin species, obtaining >111 Gb data per species (Table 2).

Table 2:

Details of the sequencing platform used and the data statistics for 21 penguin genomes

SpeciesLibrary construction strategySequencing platformRaw data (Gb)Clean data (Gb)
Eudyptes chrysolophus chrysolophus10XBGIseq500145.9126.9
Megadyptes antipodes antipodes10XBGIseq500111.9104.1
Spheniscus demersus10XBGIseq500141.1131.3
Spheniscus mendiculus10XBGIseq500112.2104.4
Eudyptula minor albosignata10XBGIseq500132.5124.8
Eudyptula minor minor10XBGIseq500121.4112.7
Eudyptula novaehollandiae10XBGIseq500180.4168.5
Pygoscelis papua10XBGIseq500134.5124.0
Pygoscelis antarctica10XBGIseq500154.5139.7
Aptenodytes patagonicus10XBGIseq500147.6134.0
Eudyptes chrysolophus schlegeli250 bp, 2 kb, 5 kb, 10 kbBGIseq500402.6296.6
Eudyptes pachyrhynchus250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000146.4104.7
Eudyptes robustus250 bp, 2 kbHiSeq X ten and HiSeq 4000171.2107.6
Eudyptes sclateri250 bp, 2 kb, 5 kbHiSeq X ten and HiSeq 4000156.2103.2
Eudyptes filholi250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000195.0146.8
Eudyptes chrysocome250 bp, 2 kb, 5 kbHiSeq X ten and HiSeq 4000195.1111.6
Eudyptes moseleyi250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000173.6133.1
Spheniscus magellanicus250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000212.6150.7
Spheniscus humboldti250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000208.8137.2
SpeciesLibrary construction strategySequencing platformRaw data (Gb)Clean data (Gb)
Eudyptes chrysolophus chrysolophus10XBGIseq500145.9126.9
Megadyptes antipodes antipodes10XBGIseq500111.9104.1
Spheniscus demersus10XBGIseq500141.1131.3
Spheniscus mendiculus10XBGIseq500112.2104.4
Eudyptula minor albosignata10XBGIseq500132.5124.8
Eudyptula minor minor10XBGIseq500121.4112.7
Eudyptula novaehollandiae10XBGIseq500180.4168.5
Pygoscelis papua10XBGIseq500134.5124.0
Pygoscelis antarctica10XBGIseq500154.5139.7
Aptenodytes patagonicus10XBGIseq500147.6134.0
Eudyptes chrysolophus schlegeli250 bp, 2 kb, 5 kb, 10 kbBGIseq500402.6296.6
Eudyptes pachyrhynchus250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000146.4104.7
Eudyptes robustus250 bp, 2 kbHiSeq X ten and HiSeq 4000171.2107.6
Eudyptes sclateri250 bp, 2 kb, 5 kbHiSeq X ten and HiSeq 4000156.2103.2
Eudyptes filholi250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000195.0146.8
Eudyptes chrysocome250 bp, 2 kb, 5 kbHiSeq X ten and HiSeq 4000195.1111.6
Eudyptes moseleyi250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000173.6133.1
Spheniscus magellanicus250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000212.6150.7
Spheniscus humboldti250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000208.8137.2

HiSeq X ten was used for sequencing small insert size libraries; HiSeq 4000 was used for sequencing mate-pair libraries.

Table 2:

Details of the sequencing platform used and the data statistics for 21 penguin genomes

SpeciesLibrary construction strategySequencing platformRaw data (Gb)Clean data (Gb)
Eudyptes chrysolophus chrysolophus10XBGIseq500145.9126.9
Megadyptes antipodes antipodes10XBGIseq500111.9104.1
Spheniscus demersus10XBGIseq500141.1131.3
Spheniscus mendiculus10XBGIseq500112.2104.4
Eudyptula minor albosignata10XBGIseq500132.5124.8
Eudyptula minor minor10XBGIseq500121.4112.7
Eudyptula novaehollandiae10XBGIseq500180.4168.5
Pygoscelis papua10XBGIseq500134.5124.0
Pygoscelis antarctica10XBGIseq500154.5139.7
Aptenodytes patagonicus10XBGIseq500147.6134.0
Eudyptes chrysolophus schlegeli250 bp, 2 kb, 5 kb, 10 kbBGIseq500402.6296.6
Eudyptes pachyrhynchus250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000146.4104.7
Eudyptes robustus250 bp, 2 kbHiSeq X ten and HiSeq 4000171.2107.6
Eudyptes sclateri250 bp, 2 kb, 5 kbHiSeq X ten and HiSeq 4000156.2103.2
Eudyptes filholi250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000195.0146.8
Eudyptes chrysocome250 bp, 2 kb, 5 kbHiSeq X ten and HiSeq 4000195.1111.6
Eudyptes moseleyi250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000173.6133.1
Spheniscus magellanicus250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000212.6150.7
Spheniscus humboldti250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000208.8137.2
SpeciesLibrary construction strategySequencing platformRaw data (Gb)Clean data (Gb)
Eudyptes chrysolophus chrysolophus10XBGIseq500145.9126.9
Megadyptes antipodes antipodes10XBGIseq500111.9104.1
Spheniscus demersus10XBGIseq500141.1131.3
Spheniscus mendiculus10XBGIseq500112.2104.4
Eudyptula minor albosignata10XBGIseq500132.5124.8
Eudyptula minor minor10XBGIseq500121.4112.7
Eudyptula novaehollandiae10XBGIseq500180.4168.5
Pygoscelis papua10XBGIseq500134.5124.0
Pygoscelis antarctica10XBGIseq500154.5139.7
Aptenodytes patagonicus10XBGIseq500147.6134.0
Eudyptes chrysolophus schlegeli250 bp, 2 kb, 5 kb, 10 kbBGIseq500402.6296.6
Eudyptes pachyrhynchus250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000146.4104.7
Eudyptes robustus250 bp, 2 kbHiSeq X ten and HiSeq 4000171.2107.6
Eudyptes sclateri250 bp, 2 kb, 5 kbHiSeq X ten and HiSeq 4000156.2103.2
Eudyptes filholi250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000195.0146.8
Eudyptes chrysocome250 bp, 2 kb, 5 kbHiSeq X ten and HiSeq 4000195.1111.6
Eudyptes moseleyi250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000173.6133.1
Spheniscus magellanicus250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000212.6150.7
Spheniscus humboldti250 bp, 2 kb, 5 kb, 10 kbHiSeq X ten and HiSeq 4000208.8137.2

HiSeq X ten was used for sequencing small insert size libraries; HiSeq 4000 was used for sequencing mate-pair libraries.

Table 3:

Assembly statistics and BUSCO results for 21 penguin genomes within a total of 4,915 conserved avian orthologs

Library construction strategySpeciesContig N50 (bp)Scaffold N50 (bp)Genome size (bp)CompleteDuplicationFragmentedMissing
10xEudyptes chrysolophus chrysolophus163,84813,794,8371,368,663,69585.40%7.70%4.40%2.50%
Megadyptes antipodes antipodes83,95423,315,1171,317,732,92391.80%1.20%4.20%2.80%
Spheniscus demersus101,40815,386,3641,278,371,92491.30%0.90%4.70%3.10%
Spheniscus mendiculus72,552380,9501,300,348,60988.90%1.60%5.70%3.80%
Eudyptula minor albosignata95,77321,866,5431,374,338,38185.60%7.40%4.20%2.80%
Eudyptula minor minor88,19021,127,6461,466,686,83184.00%8.60%4.60%2.80%
Eudyptula novaehollandiae122,46129,280,2091,357,427,56089.00%4.70%3.80%2.50%
Pygoscelis papua93,7852,780,8371,309,329,55390.70%1.50%5.00%2.80%
Pygoscelis antarctica118,3366,180,2601,265,661,67691.30%1.20%4.60%2.90%
Aptenodytes patagonicus116,7692,903,8101,256,739,11891.50%1.10%4.20%3.20%
Multi-librariesEudyptes chrysolophus schlegeli24,1911,877,5481,310,605,48893.20%1.50%3.30%2.00%
Eudyptes pachyrhynchus33,3198,795,0331,310,923,78880.20%7.70%4.30%7.80%
Eudyptes robustus29,712363,3101,248,618,55387.30%1.10%5.10%6.50%
Eudyptes sclateri69,5621,921,2441,211,737,89993.60%1.10%3.20%2.10%
Eudyptes filholi74,2806,429,2211,223,976,46893.20%1.00%3.60%2.20%
Eudyptes chrysocome66,0051,949,3231,231,067,97093.80%1.00%3.00%2.20%
Eudyptes moseleyi21,3622,248,0881,306,699,57593.60%1.20%3.00%2.20%
Spheniscus magellanicus41,45512,679,4691,262,636,73893.10%1.30%3.50%2.10%
Spheniscus humboldti19,8496,229,8191,243,403,14293.30%1.10%3.50%2.10%
Pygoscelis adeliae22,1955,118,8961,216,600,03392.80%0.60%4.00%2.60%
Aptenodytes forsteri31,7305,071,5981,254,347,44093.20%0.80%3.60%2.40%
Library construction strategySpeciesContig N50 (bp)Scaffold N50 (bp)Genome size (bp)CompleteDuplicationFragmentedMissing
10xEudyptes chrysolophus chrysolophus163,84813,794,8371,368,663,69585.40%7.70%4.40%2.50%
Megadyptes antipodes antipodes83,95423,315,1171,317,732,92391.80%1.20%4.20%2.80%
Spheniscus demersus101,40815,386,3641,278,371,92491.30%0.90%4.70%3.10%
Spheniscus mendiculus72,552380,9501,300,348,60988.90%1.60%5.70%3.80%
Eudyptula minor albosignata95,77321,866,5431,374,338,38185.60%7.40%4.20%2.80%
Eudyptula minor minor88,19021,127,6461,466,686,83184.00%8.60%4.60%2.80%
Eudyptula novaehollandiae122,46129,280,2091,357,427,56089.00%4.70%3.80%2.50%
Pygoscelis papua93,7852,780,8371,309,329,55390.70%1.50%5.00%2.80%
Pygoscelis antarctica118,3366,180,2601,265,661,67691.30%1.20%4.60%2.90%
Aptenodytes patagonicus116,7692,903,8101,256,739,11891.50%1.10%4.20%3.20%
Multi-librariesEudyptes chrysolophus schlegeli24,1911,877,5481,310,605,48893.20%1.50%3.30%2.00%
Eudyptes pachyrhynchus33,3198,795,0331,310,923,78880.20%7.70%4.30%7.80%
Eudyptes robustus29,712363,3101,248,618,55387.30%1.10%5.10%6.50%
Eudyptes sclateri69,5621,921,2441,211,737,89993.60%1.10%3.20%2.10%
Eudyptes filholi74,2806,429,2211,223,976,46893.20%1.00%3.60%2.20%
Eudyptes chrysocome66,0051,949,3231,231,067,97093.80%1.00%3.00%2.20%
Eudyptes moseleyi21,3622,248,0881,306,699,57593.60%1.20%3.00%2.20%
Spheniscus magellanicus41,45512,679,4691,262,636,73893.10%1.30%3.50%2.10%
Spheniscus humboldti19,8496,229,8191,243,403,14293.30%1.10%3.50%2.10%
Pygoscelis adeliae22,1955,118,8961,216,600,03392.80%0.60%4.00%2.60%
Aptenodytes forsteri31,7305,071,5981,254,347,44093.20%0.80%3.60%2.40%
Table 3:

Assembly statistics and BUSCO results for 21 penguin genomes within a total of 4,915 conserved avian orthologs

Library construction strategySpeciesContig N50 (bp)Scaffold N50 (bp)Genome size (bp)CompleteDuplicationFragmentedMissing
10xEudyptes chrysolophus chrysolophus163,84813,794,8371,368,663,69585.40%7.70%4.40%2.50%
Megadyptes antipodes antipodes83,95423,315,1171,317,732,92391.80%1.20%4.20%2.80%
Spheniscus demersus101,40815,386,3641,278,371,92491.30%0.90%4.70%3.10%
Spheniscus mendiculus72,552380,9501,300,348,60988.90%1.60%5.70%3.80%
Eudyptula minor albosignata95,77321,866,5431,374,338,38185.60%7.40%4.20%2.80%
Eudyptula minor minor88,19021,127,6461,466,686,83184.00%8.60%4.60%2.80%
Eudyptula novaehollandiae122,46129,280,2091,357,427,56089.00%4.70%3.80%2.50%
Pygoscelis papua93,7852,780,8371,309,329,55390.70%1.50%5.00%2.80%
Pygoscelis antarctica118,3366,180,2601,265,661,67691.30%1.20%4.60%2.90%
Aptenodytes patagonicus116,7692,903,8101,256,739,11891.50%1.10%4.20%3.20%
Multi-librariesEudyptes chrysolophus schlegeli24,1911,877,5481,310,605,48893.20%1.50%3.30%2.00%
Eudyptes pachyrhynchus33,3198,795,0331,310,923,78880.20%7.70%4.30%7.80%
Eudyptes robustus29,712363,3101,248,618,55387.30%1.10%5.10%6.50%
Eudyptes sclateri69,5621,921,2441,211,737,89993.60%1.10%3.20%2.10%
Eudyptes filholi74,2806,429,2211,223,976,46893.20%1.00%3.60%2.20%
Eudyptes chrysocome66,0051,949,3231,231,067,97093.80%1.00%3.00%2.20%
Eudyptes moseleyi21,3622,248,0881,306,699,57593.60%1.20%3.00%2.20%
Spheniscus magellanicus41,45512,679,4691,262,636,73893.10%1.30%3.50%2.10%
Spheniscus humboldti19,8496,229,8191,243,403,14293.30%1.10%3.50%2.10%
Pygoscelis adeliae22,1955,118,8961,216,600,03392.80%0.60%4.00%2.60%
Aptenodytes forsteri31,7305,071,5981,254,347,44093.20%0.80%3.60%2.40%
Library construction strategySpeciesContig N50 (bp)Scaffold N50 (bp)Genome size (bp)CompleteDuplicationFragmentedMissing
10xEudyptes chrysolophus chrysolophus163,84813,794,8371,368,663,69585.40%7.70%4.40%2.50%
Megadyptes antipodes antipodes83,95423,315,1171,317,732,92391.80%1.20%4.20%2.80%
Spheniscus demersus101,40815,386,3641,278,371,92491.30%0.90%4.70%3.10%
Spheniscus mendiculus72,552380,9501,300,348,60988.90%1.60%5.70%3.80%
Eudyptula minor albosignata95,77321,866,5431,374,338,38185.60%7.40%4.20%2.80%
Eudyptula minor minor88,19021,127,6461,466,686,83184.00%8.60%4.60%2.80%
Eudyptula novaehollandiae122,46129,280,2091,357,427,56089.00%4.70%3.80%2.50%
Pygoscelis papua93,7852,780,8371,309,329,55390.70%1.50%5.00%2.80%
Pygoscelis antarctica118,3366,180,2601,265,661,67691.30%1.20%4.60%2.90%
Aptenodytes patagonicus116,7692,903,8101,256,739,11891.50%1.10%4.20%3.20%
Multi-librariesEudyptes chrysolophus schlegeli24,1911,877,5481,310,605,48893.20%1.50%3.30%2.00%
Eudyptes pachyrhynchus33,3198,795,0331,310,923,78880.20%7.70%4.30%7.80%
Eudyptes robustus29,712363,3101,248,618,55387.30%1.10%5.10%6.50%
Eudyptes sclateri69,5621,921,2441,211,737,89993.60%1.10%3.20%2.10%
Eudyptes filholi74,2806,429,2211,223,976,46893.20%1.00%3.60%2.20%
Eudyptes chrysocome66,0051,949,3231,231,067,97093.80%1.00%3.00%2.20%
Eudyptes moseleyi21,3622,248,0881,306,699,57593.60%1.20%3.00%2.20%
Spheniscus magellanicus41,45512,679,4691,262,636,73893.10%1.30%3.50%2.10%
Spheniscus humboldti19,8496,229,8191,243,403,14293.30%1.10%3.50%2.10%
Pygoscelis adeliae22,1955,118,8961,216,600,03392.80%0.60%4.00%2.60%
Aptenodytes forsteri31,7305,071,5981,254,347,44093.20%0.80%3.60%2.40%

Genome assembly and quality evaluation

Sequences obtained from the 250-bp insert size libraries and the 10x libraries were used to evaluate the genome size for each penguin using a k-mer approach [79]. Reads were scanned using a 17-bp window with 1 bp sliding and the frequency of each 17 k-mer was recorded. After all the reads were scanned, the k-mer frequency distributions were plotted and the depth with the highest frequency (K_dep) was defined. The genome size was estimated as the read number * (read length – 17 + 1)/K_dep. The filtered reads for the 10x libraries were only used for estimating the genome size with 17 k-mer, while all reads were used for Supernova assembly.

Sequencing errors have a major effect on subsequent genome assembly because they both introduce mistakes in the assembly and also decrease the assembly continuities. Several features can be linked to sequencing noise, including low-quality bases, adaptor contamination, and duplication [80]. To remove the potential biases introduced by sequencing noise, we filtered our raw sequencing reads prior to genome assembly, following strict standards including (i) discarding paired-end reads containing overlaps, (ii) removing reads with >20% low-quality bases as the quality score was <10, (iii) removing reads with >5% ambiguous N bases, (iv) removing paired-end reads containing identical sequences likely to be PCR duplicates, and (v) removing reads with adaptor sequences. Following filtering, each genome contained >104 Gb data. Overall, we obtained a total of 2.56 Tb high-quality data for all 19 penguin genomes (Table 2).

Both SOAPdenovo v. 2–2.04 (SOAPdenovo2, RRID:SCR_014986) [81] and Allpaths-LG (ALLPATHS-LG, RRID:SCR_010742) [82] were used to assemble the genomic libraries from the various insert sizes. For SOAPdenovo, paired-end reads from small insert size libraries were used to construct de Bruijn graphs, with various k-mer ranging from 23 to 47. Contigs were subsequently constructed using contig modular with the “-D 1 -g” parameter to remove edges containing coverages no larger than 1. Following this, “map -k 35 -g” was used to map mate-pair reads into contigs, with k-mer size 35. Finally, we conducted scaffolding with parameters “scaff -g -F” to assemble the contigs into longer linkages. The best version, in terms of various k-mer in the graph construction step, was chosen as the SOAPdenovo representative for each species. In addition, we also assembled genomic libraries from various insert sizes using Allpaths-LG following the default parameters. By comparing the assemblies from both SOAPdenovo and Allpaths-LG, according to both the scaffold N50 and the total length, we chose the best assembler as a representative for each of the 19 penguin species. Supernova v. 2.0 [83], recommended for 10x genomic data [83], was used to assemble those species with 10x genomic libraries, following the default parameters. The optimal assembly strategy chosen for each penguin species is listed in Supplementary Table 2. For each assembly, we used GapCloser v. 1.12 (GapCloser, RRID:SCR_015026) [81] to locally assemble and close gaps within each scaffold following the default parameters.

All penguins (including those obtained in Li et al. [51]) were estimated to have a ∼1.3-Gb genome (Fig.   2), containing little variances. Most assemblies have both a longer scaffold N50 and contig N50 than the Aptenodytes forsteri and Pygoscelis adeliae assemblies obtained by Li et al. [51] (Fig. 2). In total, the 21 genomes contained a scaffold N50 >1 Mb, and of those, 13 genomes contained a scaffold N50 >3 Mb. All penguin genomes contain a contig N50 >19 kb and 15 of the genomes are >30 kb. The maximum contig N50 extends to 163 kb for the macaroni penguin (Eudyptes chrysolophus chrysolophus) (Fig. 2). The highest-quality genome is Eudyptula novaehollandiae, encompassing a 29.3-Mb scaffold N50. Therefore, our results demonstrate consistency and high quality among all 21 penguin genomes (Fig. 2).

Figure 2:

Genome assembly statistics of all penguin species. A, Dot plot of the quality of each index showing contig N50 (maximum is Eudyptes chrysolophus chrysolophus with 163,848 bp; minimum is Spheniscus humboldti with 19,849 bp) and scaffold N50 (maximum is Eudyptula novaehollandiae with 29,280,209 bp; minimum is Eudyptes robustus with 363,310 bp). Each symbol indicates a penguin species, the x-axis indicates the scaffold N50, and the y-axis indicates the contig N50 for each species. B, Genome size for each penguin species (maximum is Eudyptula minor with 1,466,686,831 bp; minimum is Eudyptes sclateri with 1,211,737,899 bp). C, BUSCO assessments of all penguin genomes, showing the percentage of complete, duplicated, fragmented, or missing data. See Table   3 for more details. The symbols for each penguin species correspond to the symbols used in Fig. 1. and Fig. 3.

The genome assembly completeness provides an evaluation of the assembly quality. We used BUSCO v. 3.0.2 (BUSCO, RRID:SCR_015008) [84] to evaluate our newly assembled penguin genomes with the avian database aves_odb9, which encompasses 4,915 conserved avian orthologs (Table 3). Only ∼3% of the core genes in aves_odb9 could not be annotated on the 21 penguin genomes (ranging between 2% and 7.8%). This demonstrates that all 21 penguin genomes are near-complete, containing only a few gaps. We identified an average of 90% complete core genes on each of the 21 penguin genomes, with the richest being 93.8% on Eudyptes chrysocome. Furthermore, when several genes were annotated in >1 copy, we considered them to be duplications. Duplication rates among the 21 penguin genomes varied only between 0.6% and 8.6%. In addition, only ∼4% of the core genes were partly annotated on each of the 21 penguin genomes (Fig. 2). Overall, we obtained almost-complete, high-quality genomes. Our genomic dataset (including those obtained in Li et al. [51]) encompasses all extant penguin species, representing a comprehensive dataset.

Repeat annotation

We used RepeatMasker v. 4.0.7 (RepeatMasker, RRID:SCR_012954) [85, 86], TRF v. 4.09 [87], and RepeatModeler v. 1.0.8 (RepeatModeler, RRID:SCR_015027) [88, 86] to identify repetitive sequences in each of the penguin genomes. We compared our genomes to 5 avian outgroups: wedge-rumped storm petrel (Hydrobates tethys), Wilson's storm petrel (Oceanites oceanicus), Atlantic yellow-nosed albatross (Thalassarche chlororhynchos), zebra finch (Taeniopygia guttata), and chicken (Gallus gallus). Genome sequences were aligned to RepBase23.04 [89] through RepeatMasker, and each hit was further classified into detailed categories. Tandem repeats, which are a series of DNA sequences containing >2 adjacent copies, were identified with TRF using the default parameters. In addition, we used RepeatModeler in a de novo repeat family identifying approach. All identified repeat elements were classified into 7 categories (DNA, long interspersed nuclear element [LINE], short interspersed nuclear element [SINE], long terminal repeat [LTR], other, unknown, tandem repeat) according to classification in repeat databases. Repeat annotations using the 3 methods were combined into a non-redundant repeat annotation for each penguin genome and the 5 outgroups.

Approximately 10% of the genome sequences were identified as repeat elements on each penguin genome, which is similar to the 5 outgroups (Table 2). Although all penguin genomes had similar repeat content, they varied in content for each category. In all penguins and outgroups, the most abundant repeat category was LINE. E. moseleyi has the richest tandem repeats of 3.52%, which is substantially greater than Aptenodytes forsteri, which has a richest tandem repeats of 2.24% and contains the second richest tandem repeats repeat in all penguins. Eudyptula minor minor had the most genome sequences identified as LTR (4.26%). See Table 4 for specific details on repeat annotations for each species.

Table 4:

Repeat annotation results for 21 penguins and 5 outgroups

 SpeciesDNALINESINELTROtherUnknownTRFTotal
Length (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genome
Eudyptes chrysolophus schlegeli10,967,9930.8456,600,2584.321,886,0420.1423,772,8201.811,7090.000137,181,8430.5527,041,0732.06122,778,3149.37
Eudyptes chrysolophus chrysolophus9,840,5770.7281,007,8975.922,325,6300.1742,950,4883.142,1090.000156,349,6690.467,624,7520.56147,221,28310.80
Eudyptes pachyrhynchus9,700,5490.7457,537,4114.391,761,6710.1326,951,8712.067,1630.000558,778,9950.6715,315,1091.17115,154,4998.78
Eudyptes robustus10,035,1610.8054,876,9084.401,694,8960.1421,900,2401.751,1970.0000966,793,7840.5413,082,3501.05105,161,0388.42
Eudyptes sclateri9,603,1060.7957,388,3364.741,648,5340.1422,555,2831.862,1550.000185,455,8960.457,045,8580.58101,615,9428.39
Eudyptes filholi9,447,8240.7758,471,1854.781,894,9150.1623,146,9531.892,6620.000228,146,7130.677,812,6340.64104,766,9148.56
Eudyptes chrysocome9,067,9620.7458,040,2644.711,608,6440.1322,515,8091.832,0950.000177,321,7220.607,332,6110.60103,276,4478.39
Eudyptes moseleyi9,367,9540.7258,805,4254.501,990,4690.1523,593,7671.812,6640.000209,786,6330.7545,959,2933.52141,103,33010.80
Megadyptes antipodes antipodes9,608,3490.7378,978,6185.991,728,5240.1346,464,4183.531,0590.0000808,168,7850.627,802,0480.59148,977,69311.30
Spheniscus magellanicus10,393,3490.8265,351,0675.181,812,3550.1426,759,5432.121,5460.000129,851,2370.7810,398,9340.82118,099,1799.35
Spheniscus demersus9,811,4670.7772,969,2935.711,610,1710.1334,709,6832.721,5090.0001220,385,5571.596,712,6980.53130,219,70910.2
Spheniscus mendiculus10,792,0370.8380,340,7736.181,694,4280.1343,906,0263.382,2650.0001713,023,3351.007,421,9790.57147,721,43111.4
Spheniscus humboldti9,850,5230.8063,427,9715.102,095,4390.1726,032,1872.092,6100.000217,051,3640.5710,846,5630.87115,794,6799.31
Eudyptula minor albosignata10,287,2540.7586,732,4466.312,230,4420.1649,548,7593.612,2850.0001710,370,6410.768,661,2850.63160,541,23911.70
Eudyptula minor minor10,691,1410.7395,293,4826.501,790,4480.1262,515,5344.262,2450.000158,460,2990.589,083,7820.62183,740,28412.5
Eudyptula novaehollandiae10,542,9980.7887,757,4666.461,654,9000.1253,144,6573.921,5220.0001112,914,7200.958,531,8300.63164,989,80112.20
Pygoscelis adeliae8,905,9650.7352,089,8164.281,643,6840.1417,580,6861.451,6850.000146,938,9500.578,565,4830.7093,839,1287.71
Pygoscelis papua10,878,0360.8379,578,5036.081,683,5740.1347,004,7883.592,1630.000178,393,8770.647,857,9580.60151,240,87711.60
Pygoscelis antarctica10,021,1090.7975,467,7825.961,660,0230.1336,515,9882.891,6450.000135,649,5210.456,850,7330.54133,620,72810.60
Aptenodytes patagonicus9,883,8300.7972,143,8445.741,669,2480.1333,210,7182.642,2730.000185,987,8570.486,868,1650.55126,913,55410.10
Aptenodytes forsteri9,648,9880.7747,421,2283.781,755,2520.1414,998,9791.201,0550.0000845,984,1140.4828,075,5182.24103,411,4678.24
Hydrobates tethys10,174,8350.8543,642,7503.651,593,2480.1313,363,1321.121,7800.000156,044,0780.5110,375,0340.8782,871,3656.93
Oceanites oceanicus8,172,7570.6953,982,1744.581,518,2130.1319,561,6011.662,2020.000196,101,2430.5210,501,1410.8997,111,6238.24
Thalassarche chlororhynchos10,390,4490.9341,856,1393.741,766,0940.1614,374,6961.292,0350.000185,822,9590.526,943,8030.6279,491,4037.11
Taeniopygia guttata5,985,0510.4951,144,9024.15883,3240.07250,817,6044.124,7130.0003813,099,8291.0625,800,7762.09137,289,21711.10
Gallus gallus13,929,7891.3378,779,2797.52571,0670.05521,043,1142.011,6380.0001620,514,5321.9610,603,8611.01129,394,28812.40
 SpeciesDNALINESINELTROtherUnknownTRFTotal
Length (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genome
Eudyptes chrysolophus schlegeli10,967,9930.8456,600,2584.321,886,0420.1423,772,8201.811,7090.000137,181,8430.5527,041,0732.06122,778,3149.37
Eudyptes chrysolophus chrysolophus9,840,5770.7281,007,8975.922,325,6300.1742,950,4883.142,1090.000156,349,6690.467,624,7520.56147,221,28310.80
Eudyptes pachyrhynchus9,700,5490.7457,537,4114.391,761,6710.1326,951,8712.067,1630.000558,778,9950.6715,315,1091.17115,154,4998.78
Eudyptes robustus10,035,1610.8054,876,9084.401,694,8960.1421,900,2401.751,1970.0000966,793,7840.5413,082,3501.05105,161,0388.42
Eudyptes sclateri9,603,1060.7957,388,3364.741,648,5340.1422,555,2831.862,1550.000185,455,8960.457,045,8580.58101,615,9428.39
Eudyptes filholi9,447,8240.7758,471,1854.781,894,9150.1623,146,9531.892,6620.000228,146,7130.677,812,6340.64104,766,9148.56
Eudyptes chrysocome9,067,9620.7458,040,2644.711,608,6440.1322,515,8091.832,0950.000177,321,7220.607,332,6110.60103,276,4478.39
Eudyptes moseleyi9,367,9540.7258,805,4254.501,990,4690.1523,593,7671.812,6640.000209,786,6330.7545,959,2933.52141,103,33010.80
Megadyptes antipodes antipodes9,608,3490.7378,978,6185.991,728,5240.1346,464,4183.531,0590.0000808,168,7850.627,802,0480.59148,977,69311.30
Spheniscus magellanicus10,393,3490.8265,351,0675.181,812,3550.1426,759,5432.121,5460.000129,851,2370.7810,398,9340.82118,099,1799.35
Spheniscus demersus9,811,4670.7772,969,2935.711,610,1710.1334,709,6832.721,5090.0001220,385,5571.596,712,6980.53130,219,70910.2
Spheniscus mendiculus10,792,0370.8380,340,7736.181,694,4280.1343,906,0263.382,2650.0001713,023,3351.007,421,9790.57147,721,43111.4
Spheniscus humboldti9,850,5230.8063,427,9715.102,095,4390.1726,032,1872.092,6100.000217,051,3640.5710,846,5630.87115,794,6799.31
Eudyptula minor albosignata10,287,2540.7586,732,4466.312,230,4420.1649,548,7593.612,2850.0001710,370,6410.768,661,2850.63160,541,23911.70
Eudyptula minor minor10,691,1410.7395,293,4826.501,790,4480.1262,515,5344.262,2450.000158,460,2990.589,083,7820.62183,740,28412.5
Eudyptula novaehollandiae10,542,9980.7887,757,4666.461,654,9000.1253,144,6573.921,5220.0001112,914,7200.958,531,8300.63164,989,80112.20
Pygoscelis adeliae8,905,9650.7352,089,8164.281,643,6840.1417,580,6861.451,6850.000146,938,9500.578,565,4830.7093,839,1287.71
Pygoscelis papua10,878,0360.8379,578,5036.081,683,5740.1347,004,7883.592,1630.000178,393,8770.647,857,9580.60151,240,87711.60
Pygoscelis antarctica10,021,1090.7975,467,7825.961,660,0230.1336,515,9882.891,6450.000135,649,5210.456,850,7330.54133,620,72810.60
Aptenodytes patagonicus9,883,8300.7972,143,8445.741,669,2480.1333,210,7182.642,2730.000185,987,8570.486,868,1650.55126,913,55410.10
Aptenodytes forsteri9,648,9880.7747,421,2283.781,755,2520.1414,998,9791.201,0550.0000845,984,1140.4828,075,5182.24103,411,4678.24
Hydrobates tethys10,174,8350.8543,642,7503.651,593,2480.1313,363,1321.121,7800.000156,044,0780.5110,375,0340.8782,871,3656.93
Oceanites oceanicus8,172,7570.6953,982,1744.581,518,2130.1319,561,6011.662,2020.000196,101,2430.5210,501,1410.8997,111,6238.24
Thalassarche chlororhynchos10,390,4490.9341,856,1393.741,766,0940.1614,374,6961.292,0350.000185,822,9590.526,943,8030.6279,491,4037.11
Taeniopygia guttata5,985,0510.4951,144,9024.15883,3240.07250,817,6044.124,7130.0003813,099,8291.0625,800,7762.09137,289,21711.10
Gallus gallus13,929,7891.3378,779,2797.52571,0670.05521,043,1142.011,6380.0001620,514,5321.9610,603,8611.01129,394,28812.40
Table 4:

Repeat annotation results for 21 penguins and 5 outgroups

 SpeciesDNALINESINELTROtherUnknownTRFTotal
Length (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genome
Eudyptes chrysolophus schlegeli10,967,9930.8456,600,2584.321,886,0420.1423,772,8201.811,7090.000137,181,8430.5527,041,0732.06122,778,3149.37
Eudyptes chrysolophus chrysolophus9,840,5770.7281,007,8975.922,325,6300.1742,950,4883.142,1090.000156,349,6690.467,624,7520.56147,221,28310.80
Eudyptes pachyrhynchus9,700,5490.7457,537,4114.391,761,6710.1326,951,8712.067,1630.000558,778,9950.6715,315,1091.17115,154,4998.78
Eudyptes robustus10,035,1610.8054,876,9084.401,694,8960.1421,900,2401.751,1970.0000966,793,7840.5413,082,3501.05105,161,0388.42
Eudyptes sclateri9,603,1060.7957,388,3364.741,648,5340.1422,555,2831.862,1550.000185,455,8960.457,045,8580.58101,615,9428.39
Eudyptes filholi9,447,8240.7758,471,1854.781,894,9150.1623,146,9531.892,6620.000228,146,7130.677,812,6340.64104,766,9148.56
Eudyptes chrysocome9,067,9620.7458,040,2644.711,608,6440.1322,515,8091.832,0950.000177,321,7220.607,332,6110.60103,276,4478.39
Eudyptes moseleyi9,367,9540.7258,805,4254.501,990,4690.1523,593,7671.812,6640.000209,786,6330.7545,959,2933.52141,103,33010.80
Megadyptes antipodes antipodes9,608,3490.7378,978,6185.991,728,5240.1346,464,4183.531,0590.0000808,168,7850.627,802,0480.59148,977,69311.30
Spheniscus magellanicus10,393,3490.8265,351,0675.181,812,3550.1426,759,5432.121,5460.000129,851,2370.7810,398,9340.82118,099,1799.35
Spheniscus demersus9,811,4670.7772,969,2935.711,610,1710.1334,709,6832.721,5090.0001220,385,5571.596,712,6980.53130,219,70910.2
Spheniscus mendiculus10,792,0370.8380,340,7736.181,694,4280.1343,906,0263.382,2650.0001713,023,3351.007,421,9790.57147,721,43111.4
Spheniscus humboldti9,850,5230.8063,427,9715.102,095,4390.1726,032,1872.092,6100.000217,051,3640.5710,846,5630.87115,794,6799.31
Eudyptula minor albosignata10,287,2540.7586,732,4466.312,230,4420.1649,548,7593.612,2850.0001710,370,6410.768,661,2850.63160,541,23911.70
Eudyptula minor minor10,691,1410.7395,293,4826.501,790,4480.1262,515,5344.262,2450.000158,460,2990.589,083,7820.62183,740,28412.5
Eudyptula novaehollandiae10,542,9980.7887,757,4666.461,654,9000.1253,144,6573.921,5220.0001112,914,7200.958,531,8300.63164,989,80112.20
Pygoscelis adeliae8,905,9650.7352,089,8164.281,643,6840.1417,580,6861.451,6850.000146,938,9500.578,565,4830.7093,839,1287.71
Pygoscelis papua10,878,0360.8379,578,5036.081,683,5740.1347,004,7883.592,1630.000178,393,8770.647,857,9580.60151,240,87711.60
Pygoscelis antarctica10,021,1090.7975,467,7825.961,660,0230.1336,515,9882.891,6450.000135,649,5210.456,850,7330.54133,620,72810.60
Aptenodytes patagonicus9,883,8300.7972,143,8445.741,669,2480.1333,210,7182.642,2730.000185,987,8570.486,868,1650.55126,913,55410.10
Aptenodytes forsteri9,648,9880.7747,421,2283.781,755,2520.1414,998,9791.201,0550.0000845,984,1140.4828,075,5182.24103,411,4678.24
Hydrobates tethys10,174,8350.8543,642,7503.651,593,2480.1313,363,1321.121,7800.000156,044,0780.5110,375,0340.8782,871,3656.93
Oceanites oceanicus8,172,7570.6953,982,1744.581,518,2130.1319,561,6011.662,2020.000196,101,2430.5210,501,1410.8997,111,6238.24
Thalassarche chlororhynchos10,390,4490.9341,856,1393.741,766,0940.1614,374,6961.292,0350.000185,822,9590.526,943,8030.6279,491,4037.11
Taeniopygia guttata5,985,0510.4951,144,9024.15883,3240.07250,817,6044.124,7130.0003813,099,8291.0625,800,7762.09137,289,21711.10
Gallus gallus13,929,7891.3378,779,2797.52571,0670.05521,043,1142.011,6380.0001620,514,5321.9610,603,8611.01129,394,28812.40
 SpeciesDNALINESINELTROtherUnknownTRFTotal
Length (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genomeLength (bp)% in genome
Eudyptes chrysolophus schlegeli10,967,9930.8456,600,2584.321,886,0420.1423,772,8201.811,7090.000137,181,8430.5527,041,0732.06122,778,3149.37
Eudyptes chrysolophus chrysolophus9,840,5770.7281,007,8975.922,325,6300.1742,950,4883.142,1090.000156,349,6690.467,624,7520.56147,221,28310.80
Eudyptes pachyrhynchus9,700,5490.7457,537,4114.391,761,6710.1326,951,8712.067,1630.000558,778,9950.6715,315,1091.17115,154,4998.78
Eudyptes robustus10,035,1610.8054,876,9084.401,694,8960.1421,900,2401.751,1970.0000966,793,7840.5413,082,3501.05105,161,0388.42
Eudyptes sclateri9,603,1060.7957,388,3364.741,648,5340.1422,555,2831.862,1550.000185,455,8960.457,045,8580.58101,615,9428.39
Eudyptes filholi9,447,8240.7758,471,1854.781,894,9150.1623,146,9531.892,6620.000228,146,7130.677,812,6340.64104,766,9148.56
Eudyptes chrysocome9,067,9620.7458,040,2644.711,608,6440.1322,515,8091.832,0950.000177,321,7220.607,332,6110.60103,276,4478.39
Eudyptes moseleyi9,367,9540.7258,805,4254.501,990,4690.1523,593,7671.812,6640.000209,786,6330.7545,959,2933.52141,103,33010.80
Megadyptes antipodes antipodes9,608,3490.7378,978,6185.991,728,5240.1346,464,4183.531,0590.0000808,168,7850.627,802,0480.59148,977,69311.30
Spheniscus magellanicus10,393,3490.8265,351,0675.181,812,3550.1426,759,5432.121,5460.000129,851,2370.7810,398,9340.82118,099,1799.35
Spheniscus demersus9,811,4670.7772,969,2935.711,610,1710.1334,709,6832.721,5090.0001220,385,5571.596,712,6980.53130,219,70910.2
Spheniscus mendiculus10,792,0370.8380,340,7736.181,694,4280.1343,906,0263.382,2650.0001713,023,3351.007,421,9790.57147,721,43111.4
Spheniscus humboldti9,850,5230.8063,427,9715.102,095,4390.1726,032,1872.092,6100.000217,051,3640.5710,846,5630.87115,794,6799.31
Eudyptula minor albosignata10,287,2540.7586,732,4466.312,230,4420.1649,548,7593.612,2850.0001710,370,6410.768,661,2850.63160,541,23911.70
Eudyptula minor minor10,691,1410.7395,293,4826.501,790,4480.1262,515,5344.262,2450.000158,460,2990.589,083,7820.62183,740,28412.5
Eudyptula novaehollandiae10,542,9980.7887,757,4666.461,654,9000.1253,144,6573.921,5220.0001112,914,7200.958,531,8300.63164,989,80112.20
Pygoscelis adeliae8,905,9650.7352,089,8164.281,643,6840.1417,580,6861.451,6850.000146,938,9500.578,565,4830.7093,839,1287.71
Pygoscelis papua10,878,0360.8379,578,5036.081,683,5740.1347,004,7883.592,1630.000178,393,8770.647,857,9580.60151,240,87711.60
Pygoscelis antarctica10,021,1090.7975,467,7825.961,660,0230.1336,515,9882.891,6450.000135,649,5210.456,850,7330.54133,620,72810.60
Aptenodytes patagonicus9,883,8300.7972,143,8445.741,669,2480.1333,210,7182.642,2730.000185,987,8570.486,868,1650.55126,913,55410.10
Aptenodytes forsteri9,648,9880.7747,421,2283.781,755,2520.1414,998,9791.201,0550.0000845,984,1140.4828,075,5182.24103,411,4678.24
Hydrobates tethys10,174,8350.8543,642,7503.651,593,2480.1313,363,1321.121,7800.000156,044,0780.5110,375,0340.8782,871,3656.93
Oceanites oceanicus8,172,7570.6953,982,1744.581,518,2130.1319,561,6011.662,2020.000196,101,2430.5210,501,1410.8997,111,6238.24
Thalassarche chlororhynchos10,390,4490.9341,856,1393.741,766,0940.1614,374,6961.292,0350.000185,822,9590.526,943,8030.6279,491,4037.11
Taeniopygia guttata5,985,0510.4951,144,9024.15883,3240.07250,817,6044.124,7130.0003813,099,8291.0625,800,7762.09137,289,21711.10
Gallus gallus13,929,7891.3378,779,2797.52571,0670.05521,043,1142.011,6380.0001620,514,5321.9610,603,8611.01129,394,28812.40

Protein-coding gene annotation

We used the annotation methods developed by The Bird 10,000 Genomes (B10K) consortium [90] to annotate the 21 penguin genomes. Prior to annotating the protein-coding genes, a non-redundant avian reference gene set, consisting of protein sequences from Taeniopygia guttata and Gallus gallus, was generated [71]. Whole-genome protein sequences of Ensembl gene sets (release-85) of Taeniopygia guttataand Gallus gallus were then used to identify 12,337 orthologs based on whole-genome synteny relationships that were downloaded from the UCSC Genome Browser [91]. For both Taeniopygia guttata and Gallus gallus, we compared the 2 proteins in each ortholog and chose the longer homologous sequence with the human ortholog protein sequence in the reference gene set. Within 12,337 orthologs, 6,888 from Taeniopygia guttata and 5,449 from Gallus gallus were selected as the reference gene set. Following this, specific genes of Taeniopygia guttata or Gallus gallus were added to the reference gene set. This reference gene set comprised 5,084 Taeniopygia guttata genes without Gallus gallus orthologs and 3,158 G. gallus genes that had not been identified as ortholog genes to Taeniopygia guttata. Finally, protein sequences were filtered if they contained <50 amino acids, consisted of function as transposons/retrotransposons, or contained only a single non-functional exon. The final avian reference gene set therefore contained 20,181 protein-coding genes.

To annotate the protein-coding genes from the penguin genomes, protein sequences from the avian reference gene set were then mapped to each of the 21 penguin genomes. First, protein sequences were aligned to each penguin genome using TBLASTN v. 2.2.2 (TBLASTN, RRID:SCR_011822) [92] with a 1e−5 e-value cut-off. Multiple adjacent hits from the same protein were then linked together using genBlastA v. 1.0.4 [93] to obtain the candidate gene boundary. A candidate hit was removed if a protein had <30% amino acids aligned to the penguin genome. For each candidate hit for each protein, we extracted genomic sequences covering this hit with 2 kb upstream and downstream of the extension. Extracted genome sequences and corresponding homologous protein sequences were then prepared as input for GeneWise v. 2.4.1 (GeneWise, RRID:SCR_015054) [94] to the annotated protein-coding gene models, which included exon and intron boundaries. Coding sequences for each annotated gene model were extracted from each genome according to the annotated gene model, and then each coding sequence was translated into the protein sequence. This annotated protein sequence was then aligned with the corresponding homolog protein sequence using MUSCLE v. 3.8.31 (MUSCLE, RRID:SCR_011812) [95], while removing annotated proteins with <40% identity with the corresponding homolog protein sequence. Annotated proteins with <30 amino acids and annotated proteins containing >2 frame shifts or 1 premature stop codon were then removed. If a genome locus had been annotated using several gene models, the gene model with the highest identity with the corresponding homolog protein was selected. Therefore, the annotated gene set for our penguin genomes contained no overlapping genes.

Protein sequences from human (hg38) and avian transcripts were also mapped to each penguin genome and the annotated gene models (as above). For the avian transcripts dataset, we obtained 71 avian transcriptomic samples from NCBI [96] (Supplementary Table 3) and assembled those into transcripts using either Newbler v2.9 [97] for 454 sequencing assemblies or Trinity v20140717 [98] for Illumina sequencing assemblies. We used ORFfinder [96] to identify open reading frames (ORFs) for transcripts, and the protein sequences were then translated from the ORF. The protein sequences translated from the transcripts were then mapped to the avian reference gene set and the human protein sequences, while removing those with similarity to the avian reference gene set or the human protein sequences. Transcripts with ORF length <150 bp were also removed. Protein sequences from 5,257 transcripts were then used for annotation. Three gene model sets annotated from the avian reference gene set, the human protein sequences, and transcriptome were then combined into a final non-redundant gene set. We prioritized 3 gene model sets in the following order: avian reference gene set > human protein > transcriptome.

After applying the above methods, we annotated the 19 newly assembled penguin genomes, as well as the 2 previously published penguin genomes [51]. We identified ∼16,000 genes on each penguin genome, which is similar to the genomes of Taeniopygia guttata and Gallus gallus. The average gene length and coding sequence length are ∼19 and 1.3 kb, respectively. Each gene encompasses ∼8 exons, with an average length of 170 bp. Intron lengths are an average length of 2.6 kb (Table 5).

Table 5:

Protein-coding gene statistics of all 21 penguin genomes and 5 outgroups

SpeciesNumber of protein-coding genesMean gene length (bp)Mean coding sequence length (bp)Mean exons per geneMean exon length (bp)Mean intron length (bp)
Eudyptes chrysolophus schlegeli17,19118,8601,3517.91712,540
Eudyptes chrysolophus chrysolophus16,31120,2481,3928.21702,623
Eudyptes pachyrhynchus19,17017,3941,3067.41782,535
Eudyptes robustus17,12616,2541,2957.41742,329
Eudyptes sclateri15,78619,6271,4028.21712,527
Eudyptes filholi15,96319,9591,4078.21712,562
Eudyptes chrysocome16,28019,4361,3828.11712,555
Eudyptes moseleyi16,81219,7671,3708.01712,621
Megadyptes antipodes antipodes16,56318,5091,3347.81712,533
Spheniscus magellanicus16,79519,3111,3818.11712,535
Spheniscus demersus16,13419,0291,3447.81712,584
Spheniscus mendiculus16,39017,0971,3117.61722,382
Spheniscus humboldti16,58719,6421,3878.11702,558
Eudyptula minor albosignata17,42418,8371,3387.81722,574
Eudyptula minor minor17,80219,0781,3497.81722,598
Eudyptula novaehollandiae17,18819,2711,3557.91722,609
Pygoscelis adeliae14,46320,5951,3858.31682,648
Pygoscelis papua16,69818,2761,3337.81722,503
Pygoscelis antarctica15,48819,5201,3818.11712,558
Aptenodytes patagonicus15,19519,5961,3848.11702,552
Aptenodytes forsteri15,59319,8441,3818.11702,584
Hydrobates tethys15,91517,8981,3448.11652,323
Oceanites oceanicus16,05517,9361,3568.01702,377
Thalassarche chlororhynchos13,34710,0291,1106.41751,667
Taeniopygia guttata19,17414,7871,1967.21672,198
Gallus gallus17,88316,9651,4148.31712,135
SpeciesNumber of protein-coding genesMean gene length (bp)Mean coding sequence length (bp)Mean exons per geneMean exon length (bp)Mean intron length (bp)
Eudyptes chrysolophus schlegeli17,19118,8601,3517.91712,540
Eudyptes chrysolophus chrysolophus16,31120,2481,3928.21702,623
Eudyptes pachyrhynchus19,17017,3941,3067.41782,535
Eudyptes robustus17,12616,2541,2957.41742,329
Eudyptes sclateri15,78619,6271,4028.21712,527
Eudyptes filholi15,96319,9591,4078.21712,562
Eudyptes chrysocome16,28019,4361,3828.11712,555
Eudyptes moseleyi16,81219,7671,3708.01712,621
Megadyptes antipodes antipodes16,56318,5091,3347.81712,533
Spheniscus magellanicus16,79519,3111,3818.11712,535
Spheniscus demersus16,13419,0291,3447.81712,584
Spheniscus mendiculus16,39017,0971,3117.61722,382
Spheniscus humboldti16,58719,6421,3878.11702,558
Eudyptula minor albosignata17,42418,8371,3387.81722,574
Eudyptula minor minor17,80219,0781,3497.81722,598
Eudyptula novaehollandiae17,18819,2711,3557.91722,609
Pygoscelis adeliae14,46320,5951,3858.31682,648
Pygoscelis papua16,69818,2761,3337.81722,503
Pygoscelis antarctica15,48819,5201,3818.11712,558
Aptenodytes patagonicus15,19519,5961,3848.11702,552
Aptenodytes forsteri15,59319,8441,3818.11702,584
Hydrobates tethys15,91517,8981,3448.11652,323
Oceanites oceanicus16,05517,9361,3568.01702,377
Thalassarche chlororhynchos13,34710,0291,1106.41751,667
Taeniopygia guttata19,17414,7871,1967.21672,198
Gallus gallus17,88316,9651,4148.31712,135
Table 5:

Protein-coding gene statistics of all 21 penguin genomes and 5 outgroups

SpeciesNumber of protein-coding genesMean gene length (bp)Mean coding sequence length (bp)Mean exons per geneMean exon length (bp)Mean intron length (bp)
Eudyptes chrysolophus schlegeli17,19118,8601,3517.91712,540
Eudyptes chrysolophus chrysolophus16,31120,2481,3928.21702,623
Eudyptes pachyrhynchus19,17017,3941,3067.41782,535
Eudyptes robustus17,12616,2541,2957.41742,329
Eudyptes sclateri15,78619,6271,4028.21712,527
Eudyptes filholi15,96319,9591,4078.21712,562
Eudyptes chrysocome16,28019,4361,3828.11712,555
Eudyptes moseleyi16,81219,7671,3708.01712,621
Megadyptes antipodes antipodes16,56318,5091,3347.81712,533
Spheniscus magellanicus16,79519,3111,3818.11712,535
Spheniscus demersus16,13419,0291,3447.81712,584
Spheniscus mendiculus16,39017,0971,3117.61722,382
Spheniscus humboldti16,58719,6421,3878.11702,558
Eudyptula minor albosignata17,42418,8371,3387.81722,574
Eudyptula minor minor17,80219,0781,3497.81722,598
Eudyptula novaehollandiae17,18819,2711,3557.91722,609
Pygoscelis adeliae14,46320,5951,3858.31682,648
Pygoscelis papua16,69818,2761,3337.81722,503
Pygoscelis antarctica15,48819,5201,3818.11712,558
Aptenodytes patagonicus15,19519,5961,3848.11702,552
Aptenodytes forsteri15,59319,8441,3818.11702,584
Hydrobates tethys15,91517,8981,3448.11652,323
Oceanites oceanicus16,05517,9361,3568.01702,377
Thalassarche chlororhynchos13,34710,0291,1106.41751,667
Taeniopygia guttata19,17414,7871,1967.21672,198
Gallus gallus17,88316,9651,4148.31712,135
SpeciesNumber of protein-coding genesMean gene length (bp)Mean coding sequence length (bp)Mean exons per geneMean exon length (bp)Mean intron length (bp)
Eudyptes chrysolophus schlegeli17,19118,8601,3517.91712,540
Eudyptes chrysolophus chrysolophus16,31120,2481,3928.21702,623
Eudyptes pachyrhynchus19,17017,3941,3067.41782,535
Eudyptes robustus17,12616,2541,2957.41742,329
Eudyptes sclateri15,78619,6271,4028.21712,527
Eudyptes filholi15,96319,9591,4078.21712,562
Eudyptes chrysocome16,28019,4361,3828.11712,555
Eudyptes moseleyi16,81219,7671,3708.01712,621
Megadyptes antipodes antipodes16,56318,5091,3347.81712,533
Spheniscus magellanicus16,79519,3111,3818.11712,535
Spheniscus demersus16,13419,0291,3447.81712,584
Spheniscus mendiculus16,39017,0971,3117.61722,382
Spheniscus humboldti16,58719,6421,3878.11702,558
Eudyptula minor albosignata17,42418,8371,3387.81722,574
Eudyptula minor minor17,80219,0781,3497.81722,598
Eudyptula novaehollandiae17,18819,2711,3557.91722,609
Pygoscelis adeliae14,46320,5951,3858.31682,648
Pygoscelis papua16,69818,2761,3337.81722,503
Pygoscelis antarctica15,48819,5201,3818.11712,558
Aptenodytes patagonicus15,19519,5961,3848.11702,552
Aptenodytes forsteri15,59319,8441,3818.11702,584
Hydrobates tethys15,91517,8981,3448.11652,323
Oceanites oceanicus16,05517,9361,3568.01702,377
Thalassarche chlororhynchos13,34710,0291,1106.41751,667
Taeniopygia guttata19,17414,7871,1967.21672,198
Gallus gallus17,88316,9651,4148.31712,135

Gene function annotation

To assign functions to each gene, we aligned each gene to 3 functional databases: Swiss-Prot release-2019_03 [99], InterPro v. 68.0 (InterPro, RRID:SCR_006695) [100], and KEGG v89.1 (KEGG, RRID:SCR_012773) [101]. Protein sequences of each gene were aligned to Swiss-Prot database using BLASTP [92], and the function of the best hit was selected as the function annotation for this gene. We then searched InterPro databases that encompass ProDom, PRINTS, Pfam, SMART, PANTHER, ProSiteProfiles, and ProSitePatterns to obtain the motifs and domains for each gene. Gene Ontology [102] terms for each gene were obtained from the corresponding InterPro entry. To identify the pathways in which the gene might be involved, protein sequences for each gene were then aligned against the KEGG database using BLASTP. For each penguin genome, a total of >99% of the protein-coding genes were assigned ≥1 function annotation in each penguin, which is similar to the 5 outgroups (Table 6). Overall, >95% of the protein genes were assigned a Swiss-Prot function, demonstrating high-quality gene sets.

Table 6:

Function annotation results for protein-coding genes for 21 penguins and 5 outgroups

SpeciesSwissprotKEGGInterproOverall
Number%Number%Number%Number%
Eudyptes chrysolophus schlegeli16,73997.3715,34789.2716,91698.4017,06499.26
Eudyptes chrysolophus chrysolophus15,86397.2514,64689.7916,05198.4116,19199.26
Eudyptes pachyrhynchus18,68097.4417,25089.9818,87398.4519,02899.26
Eudyptes robustus16,58096.8115,50090.5116,81698.1916,98899.19
Eudyptes sclateri15,38397.4514,17289.7815,54098.4415,66499.23
Eudyptes filholi15,55597.4414,36289.9715,69698.3315,84099.23
Eudyptes chrysocome15,69296.3914,73290.4915,97798.1416,14899.19
Eudyptes moseleyi16,37797.4115,15390.1316,54098.3816,68899.26
Megadyptes antipodes antipodes15,75595.1214,99390.5216,26498.1916,44599.29
Spheniscus magellanicus16,37197.4815,13690.1216,53298.4316,67099.26
Spheniscus demersus15,38895.3814,57990.3615,83998.1716,00199.18
Spheniscus mendiculus15,71495.8814,80190.3116,09098.1716,25499.17
Spheniscus humboldti16,17297.5014,95490.1516,31998.3816,46099.23
Eudyptula minor albosignata16,61595.3615,77890.5517,09898.1317,29799.27
Eudyptula minor minor16,99495.4616,07390.2917,47698.1717,66399.22
Eudyptula novaehollandiae16,42395.5515,56190.5316,89298.2817,06099.26
Pygoscelis adeliae13,96496.5513,05490.2614,22098.3214,34899.20
Pygoscelis papua15,93195.4115,09790.4116,37898.0816,55399.13
Pygoscelis antarctica15,05097.1713,85389.4415,22498.3015,36099.17
Aptenodytes patagonicus14,80897.4513,49388.8014,95498.4115,06399.13
Aptenodytes forsteri15,05396.5414,11290.5015,30898.1715,47899.26
Hydrobates tethys15,49397.3514,27389.6815,62898.2015,77599.12
Oceanites oceanicus15,62297.3014,41289.7715,77598.2615,91999.15
Thalassarche chlororhynchos12,95897.0911,88189.0213,07297.9413,21999.04
Taeniopygia guttata18,36795.7917,11589.2618,53796.6818,91898.66
Gallus gallus16,76093.7215,58587.1517,07995.5017,26396.53
SpeciesSwissprotKEGGInterproOverall
Number%Number%Number%Number%
Eudyptes chrysolophus schlegeli16,73997.3715,34789.2716,91698.4017,06499.26
Eudyptes chrysolophus chrysolophus15,86397.2514,64689.7916,05198.4116,19199.26
Eudyptes pachyrhynchus18,68097.4417,25089.9818,87398.4519,02899.26
Eudyptes robustus16,58096.8115,50090.5116,81698.1916,98899.19
Eudyptes sclateri15,38397.4514,17289.7815,54098.4415,66499.23
Eudyptes filholi15,55597.4414,36289.9715,69698.3315,84099.23
Eudyptes chrysocome15,69296.3914,73290.4915,97798.1416,14899.19
Eudyptes moseleyi16,37797.4115,15390.1316,54098.3816,68899.26
Megadyptes antipodes antipodes15,75595.1214,99390.5216,26498.1916,44599.29
Spheniscus magellanicus16,37197.4815,13690.1216,53298.4316,67099.26
Spheniscus demersus15,38895.3814,57990.3615,83998.1716,00199.18
Spheniscus mendiculus15,71495.8814,80190.3116,09098.1716,25499.17
Spheniscus humboldti16,17297.5014,95490.1516,31998.3816,46099.23
Eudyptula minor albosignata16,61595.3615,77890.5517,09898.1317,29799.27
Eudyptula minor minor16,99495.4616,07390.2917,47698.1717,66399.22
Eudyptula novaehollandiae16,42395.5515,56190.5316,89298.2817,06099.26
Pygoscelis adeliae13,96496.5513,05490.2614,22098.3214,34899.20
Pygoscelis papua15,93195.4115,09790.4116,37898.0816,55399.13
Pygoscelis antarctica15,05097.1713,85389.4415,22498.3015,36099.17
Aptenodytes patagonicus14,80897.4513,49388.8014,95498.4115,06399.13
Aptenodytes forsteri15,05396.5414,11290.5015,30898.1715,47899.26
Hydrobates tethys15,49397.3514,27389.6815,62898.2015,77599.12
Oceanites oceanicus15,62297.3014,41289.7715,77598.2615,91999.15
Thalassarche chlororhynchos12,95897.0911,88189.0213,07297.9413,21999.04
Taeniopygia guttata18,36795.7917,11589.2618,53796.6818,91898.66
Gallus gallus16,76093.7215,58587.1517,07995.5017,26396.53
Table 6:

Function annotation results for protein-coding genes for 21 penguins and 5 outgroups

SpeciesSwissprotKEGGInterproOverall
Number%Number%Number%Number%
Eudyptes chrysolophus schlegeli16,73997.3715,34789.2716,91698.4017,06499.26
Eudyptes chrysolophus chrysolophus15,86397.2514,64689.7916,05198.4116,19199.26
Eudyptes pachyrhynchus18,68097.4417,25089.9818,87398.4519,02899.26
Eudyptes robustus16,58096.8115,50090.5116,81698.1916,98899.19
Eudyptes sclateri15,38397.4514,17289.7815,54098.4415,66499.23
Eudyptes filholi15,55597.4414,36289.9715,69698.3315,84099.23
Eudyptes chrysocome15,69296.3914,73290.4915,97798.1416,14899.19
Eudyptes moseleyi16,37797.4115,15390.1316,54098.3816,68899.26
Megadyptes antipodes antipodes15,75595.1214,99390.5216,26498.1916,44599.29
Spheniscus magellanicus16,37197.4815,13690.1216,53298.4316,67099.26
Spheniscus demersus15,38895.3814,57990.3615,83998.1716,00199.18
Spheniscus mendiculus15,71495.8814,80190.3116,09098.1716,25499.17
Spheniscus humboldti16,17297.5014,95490.1516,31998.3816,46099.23
Eudyptula minor albosignata16,61595.3615,77890.5517,09898.1317,29799.27
Eudyptula minor minor16,99495.4616,07390.2917,47698.1717,66399.22
Eudyptula novaehollandiae16,42395.5515,56190.5316,89298.2817,06099.26
Pygoscelis adeliae13,96496.5513,05490.2614,22098.3214,34899.20
Pygoscelis papua15,93195.4115,09790.4116,37898.0816,55399.13
Pygoscelis antarctica15,05097.1713,85389.4415,22498.3015,36099.17
Aptenodytes patagonicus14,80897.4513,49388.8014,95498.4115,06399.13
Aptenodytes forsteri15,05396.5414,11290.5015,30898.1715,47899.26
Hydrobates tethys15,49397.3514,27389.6815,62898.2015,77599.12
Oceanites oceanicus15,62297.3014,41289.7715,77598.2615,91999.15
Thalassarche chlororhynchos12,95897.0911,88189.0213,07297.9413,21999.04
Taeniopygia guttata18,36795.7917,11589.2618,53796.6818,91898.66
Gallus gallus16,76093.7215,58587.1517,07995.5017,26396.53
SpeciesSwissprotKEGGInterproOverall
Number%Number%Number%Number%
Eudyptes chrysolophus schlegeli16,73997.3715,34789.2716,91698.4017,06499.26
Eudyptes chrysolophus chrysolophus15,86397.2514,64689.7916,05198.4116,19199.26
Eudyptes pachyrhynchus18,68097.4417,25089.9818,87398.4519,02899.26
Eudyptes robustus16,58096.8115,50090.5116,81698.1916,98899.19
Eudyptes sclateri15,38397.4514,17289.7815,54098.4415,66499.23
Eudyptes filholi15,55597.4414,36289.9715,69698.3315,84099.23
Eudyptes chrysocome15,69296.3914,73290.4915,97798.1416,14899.19
Eudyptes moseleyi16,37797.4115,15390.1316,54098.3816,68899.26
Megadyptes antipodes antipodes15,75595.1214,99390.5216,26498.1916,44599.29
Spheniscus magellanicus16,37197.4815,13690.1216,53298.4316,67099.26
Spheniscus demersus15,38895.3814,57990.3615,83998.1716,00199.18
Spheniscus mendiculus15,71495.8814,80190.3116,09098.1716,25499.17
Spheniscus humboldti16,17297.5014,95490.1516,31998.3816,46099.23
Eudyptula minor albosignata16,61595.3615,77890.5517,09898.1317,29799.27
Eudyptula minor minor16,99495.4616,07390.2917,47698.1717,66399.22
Eudyptula novaehollandiae16,42395.5515,56190.5316,89298.2817,06099.26
Pygoscelis adeliae13,96496.5513,05490.2614,22098.3214,34899.20
Pygoscelis papua15,93195.4115,09790.4116,37898.0816,55399.13
Pygoscelis antarctica15,05097.1713,85389.4415,22498.3015,36099.17
Aptenodytes patagonicus14,80897.4513,49388.8014,95498.4115,06399.13
Aptenodytes forsteri15,05396.5414,11290.5015,30898.1715,47899.26
Hydrobates tethys15,49397.3514,27389.6815,62898.2015,77599.12
Oceanites oceanicus15,62297.3014,41289.7715,77598.2615,91999.15
Thalassarche chlororhynchos12,95897.0911,88189.0213,07297.9413,21999.04
Taeniopygia guttata18,36795.7917,11589.2618,53796.6818,91898.66
Gallus gallus16,76093.7215,58587.1517,07995.5017,26396.53

Phylogenomic reconstruction

To understand the evolutionary history of all extant penguins, we created a phylogeny of penguins using the genomic-level orthologs with coalescent-based ExaML and concatenation-based methods MP-EST and ASTRAL [103–105]. We first applied rigorous filtering steps to obtain 7,235 high-quality orthologs. This was achieved by filtering ∼13,214 orthologs (BLAST reciprocal best hits [RBHs]) that were present in the Taeniopygia guttata genome and the 21 penguins/5 avian outgroup genomes (described above), retaining orthologs with no missing data, and removing sequences containing internal stop codons. We aligned and filtered our alignment data using several methods: (i) protein sequences were aligned using MAFTT v. 7.313 [106] following “linsi” parameters for local, iterative progressive alignment; (ii) we also applied column-based alignment filtering using trimAl v. 1.4.rev22 [107], using the parameter “automated1” to heuristically choose trimming parameters based on input alignment characters; (iii) nucleic acid alignments were also obtained using trimAl, using the parameter “backtrans” to obtain a back-translation for a given amino acid alignment. Alignment filtering was applied to (i) the column-based alignments, by removing all missing data, and retaining alignment lengths >50 bp (resulting in 7,229 orthologs, the “TrimAl data” set); and (ii) applying a full-matrix occupancy to the no missing dataset (retaining 7,011 orthologs, the “No missing data” set) following the pipeline published previously [108]. Loci containing no missing taxa were then retained, by removing alignment columns containing gaps, undetermined bases (Ns), or ambiguity characters and loci with a post-filtering alignment length <200 bp.

We constructed gene trees for each locus using RAxML v8.2.12 (RAxML, RRID:SCR_006086) [109] and then constructed phylogenomic trees using 2 coalescent-based methods, MP-EST v. 2.0 and ASTRAL-III, based on the gene trees. First, we used RAxML v. 8.2.12 to infer the highest-scoring maximum likelihood tree from unpartitioned alignments for each locus using a GTR+GAMMA substitution model, 20 independent tree searches beginning from random starting tree topologies, and 500 bootstrap replicates for each locus. Resulting gene trees were rooted with Gallus gallus using the “ape” package in R v. 3.5.2 [110]. We then created a coalescent-based phylogenetic tree using MP-EST v. 2.0 [104] by estimating trees from a set of rooted gene trees by maximizing a pseudo-likelihood function. Species tree and bootstrap topology searches were achieved over 3 independent replicates, using a different starting seed and with 10 independent tree searches per run. The highest-scoring tree in 10 tree searches was kept as the result for each replicate. Because the 3 final trees from MP-EST replicates shared the same tree topology, we kept the highest-scoring tree as the final tree for further analysis. Branch lengths were re-estimated in coalescent units of substitutions per site by constraining alignments to the MP-EST tree topology using the “-f E” option in ExaML v.3.0.21 [103]. Bootstrap values were plotted using RAxML based on the bootstrap replicates, and trees were outgroup-rooted with G. gallus. In addition, we used the coalescent-based method ASTRAL-III [105] with default parameters to obtain the tree with the maximum number of shared induced quartet trees in the set of unrooted gene trees, constrained by the set of bipartitions in the tree based on a predefined set of partitions. The inferenced trees also shared the same tree topology with the MP-EST results. Then, the concatenation-based phylogenomic inference was conducted using ExaML v3.0.21. This was achieved using a GTR+GAMMA substitution model on the partitioned (each locus as a separate partition), concatenated alignments, and inferring the topology from 21 full maximum likelihood tree searchers: 20 beginning with random starting trees, and a single search beginning with the random stepwise addition order parsimony tree conducted using RAxML. For each dataset, 100 ExaML bootstrap replicates were conducted and convergence was assessed according to the bootstrapping analysis and applying a majority-rule consensus tree criterion in RAxML with option “-I autoMRE”. We then compared the resulting trees obtained using the “TrimAl data” and the “No missing data” from coalescent-based MP-EST and ASTRAL with concatenation-based ExaML (Supplementary Fig. 1).

While the resulting topologies of the outgroups Hydrobates tethys, Oceanites oceanicus, and Thalassarche chlororhynchos are slightly different between coalescent-based and concatenation-based methods, the topologies of our penguin genomes are identical using both methods (Fig. 3). Our final phylogeny (Fig. 3) encompassing all extant penguin genomes is slightly different to a recent phylogenetic study using mitochondrial genomes [3]. Specifically, while the mitochondrial phylogeny suggested that Aptenodytes+Pygoscelis are sister to all other penguins, our full genome phylogeny suggests that Aptenodytes alone is sister to all other penguins. This result confirms earlier results combining data from a small set of mitochondrial genes and the nuclear RAG-1gene [1, 62] and provides intriguing new evidence on the historical biogeographical and evolutionary patterns of adaptation to Antarctica. We expect this novel genomic dataset to provide further important insights into the evolution of penguins in the southern hemisphere.

Figure 3:

Phylogenomic reconstruction of penguins inferred by the ExaML method with no missing data. The topology of all clades was strongly supported (bootstrap support: 100). The topology and support were identical using the MP-EST and ASTRAL methods (with no missing data) except for the outgroup (bootstrap support for the split between Hydrobates tethys and Oceanites oceanicus: 37) and within the penguin genus Spheniscus (bootstrap support for the split between the African penguin [Spheniscus demersus] and the magellanic penguin [S. magellanicus]: 97).

Re-use Potential

Consortium organization and further research plans

The 19 high-coverage genomes presented here, along with the Aptenodytes forsteri and Pygoscelis adeliae genomes presented by members of our consortium in 2014 [51], provide an exciting resource for understanding evolutionary diversification, the molecular basis for unique functional adaptation, and demographic histories of penguins. The Penguin Genome Consortium is an international team of scientists with backgrounds in marine ornithology, ecology, molecular biology, evolutionary and comparative genomics, phylogenetics, physiology, palaeontology, veterinary science, and bioinformatics. The diverse skills encompassed within our highly collaborative consortium will be essential to study these genomes under comparative genomic and evolutionary frameworks. In doing so, we will expand on [51] by investigating 3 key areas related to penguin evolution and adaptation.

Evolutionary relationships and taxonomic boundaries

With a deep evolutionary history, and diverse radiation, penguins provide an exciting system to understand the evolutionary drivers of diversification [3]. Moreover, robust taxonomic frameworks can be crucial for directing limited conservation resources for maximum gains. Significant uncertainty remains regarding species/lineage boundaries between some closely related penguin taxa. The genomes generated here therefore provide an exciting new dataset to examine taxonomic, phylogenomic, and biogeographical patterns for understanding penguin evolution.

Comparative genomics and adaptation

Penguins provide an excellent system to study comparative evolutionary adaptation [51]. We will use our genomes to explore comparative evolution among penguins, and between penguins and other avian orders. By examining loci under positive selection, we shall reveal the molecular basis for the unique physiological and morphological adaptations to different environments and ecologies that are exhibited by penguins.

Penguins in a changing world

Penguins are sensitive indicators of environmental change [44, 45]. It is predicted that future climate change will lead to significant declines in many penguin populations [47–50]. Conservation management decisions can be guided by demographic assessments. However, there remains a substantial gap in predicting ecosystem-wide changes to future climate change. As such, demographic analyses of these genomes will be critical for conservation management of penguins and other Southern Ocean assemblages.

Cultural significance

The context in which wildlife research in New Zealand is undertaken is evolving rapidly and heading into new legal and novel cultural contexts [111–114]. Recent initiatives such as the bestowing of the rights of an individual on Te Urewera, a former national park, set an international precedent for this change in approach [115]. Therefore, it is critical that research permissions be obtained and appropriate indigenous consultation with Iwi, Rūnanga, Whānau, and Hapū be conducted. The regulatory arm of the New Zealand government in this process, the Department of Conservation, is legally required to give effect to the Principles of the Treaty of Waitangi [116] in its administration of the legislation pursuant to which Authorities are issues.

At another level the Ngāi Tahu Deed of Settlement Act recognizes all native penguin species as Taonga, or treasured possessions [117]. Consequently, not only is it a legal requirement to undertake rigorous Māori consultation when studying Taonga [118, 119], the Department of Conservation has to have particular regard to the views of Iwi, Rūnanga, Whānau, or Hapū when considering whether to authorize any application. Recent discussions have also emphasized that Taonga genomes are sacred (tapu) because they are considered to contain both the living and the future generations (whakapapa, mauri, and wairua of tipuna), engendering Māori concerns surrounding the commercialization, ownership, storage, and modification of Taonga genomes [120]. We generated Taonga genomes encompassing hoiho (yellow-eyed penguin, Megadyptes antipodes antipodes), kororā (little penguin, Eudyptula spp.), pokotiwha (Snares-crested penguin, E. robustus), tawaki (Fiordland-crested penguin, E. pachyrhynchus), and erect-crested penguin (Eudyptes sclateri). These genomes were obtained following rigorous Department of Conservation permitting procedures (including collection, holding, and exporting permits) and following Department of Conservation Iwi, Rūnanga, Whānau, or Hapū consultation (Supplementary Table 1). Several of the Taonga genomes studied here were collected alongside broader research projects, and additional consultation efforts were undertaken for those projects. We emphasize that there will be no commercialization, ownership, or modification of any of the genomes presented here. While these Taonga genomes will be publicly available, it is critical that new researchers studying these genomes take the appropriate steps to seek additional Māori permissions and consultation, which will ensure respect of New Zealand cultural values.

The emerging issues surrounding the generation and use of Taonga genomes also highlight that Māori consultation should also be undertaken when obtaining genomes from Taonga housed in overseas museum collections. We hope that the data and our research questions presented here, and our future research outputs using these genomes will be valuable for both cultural heritage and for conservation management of penguin populations.

Early-release use of the data

The Fort Lauderdale [121] and Toronto [122] agreements state that in exchange for early release of datasets, the data producers retain the right to be the first to describe and analyse the complete datasets in peer-reviewed publications. Comparative and evolutionary genomic analyses are currently being carried out, and the consortium welcomes new members interested in contributing to this work. While this work is still underway we have published these 19 penguin genomes to provide early access, while requesting researchers intending to use these data for similar cross-species comparisons to continue to follow the long-running Fort Lauderdale and Toronto rules.

Conclusions

Genomics is prohibitively costly—it requires high-quality samples and extensive laboratory and bioinformatic skills. The genomics era has been boosted by global research consortiums, which bring together contextual, technical, and analytical skills spanning a network of international collaborations [123–126]. Our consortium and dataset introduced here are no exception, and as such, we expect our future research using these genomes to bring together additional collaborators that encompass a wide range of expertise regarding penguin biology and physiology. At another level, collecting high-quality fresh blood samples from some of the most remote regions in the Southern Ocean remains technically and logistically difficult, requiring the efforts and long-term organization from many collaborations and expedition programs. While this study is an exciting development for understanding the evolution of penguins, the global efforts involved in designing our study, obtaining samples, and developing appropriate sequencing and bioinformatic pipelines have been extensive. The dataset and project design introduced here highlight the need for transparent research projects and global collaborations, which together maximize the use of samples, minimizing sequencing costs, and laboratory and analytical efforts.

In this study we have presented 19 new high-coverage penguin genomes. Together with 2 genomes previously obtained by members of our consortium [51], this combined dataset encompasses the genomes of all extant penguin species. We have also constructed a comprehensive phylogenomic tree encompassing all extant penguins. We will use these datasets to address a range of evolutionary, adaptive, biogeographic, and demographic questions regarding penguins. As such, we hope not only that our ongoing projects that encompass these genomes will provide novel insights for understanding the broad evolution and adaptation of avifauna to different environments but also that this knowledge will increase cultural heritage and aid conservation management decisions for remote Southern Ocean regions.

Availability of supporting data and materials

The genome sequencing data and assemblies of this study have been deposited in the CNSA (https://db.cngb.org/cnsa/) of the CNGBdb database with the accession number CNP0000605, as well as the NCBI database with the Bioproject ID PRJNA556735 (Aptenodytes patagonicus: SAMN12384866; Eudyptes chrysolophus chrysolophus: SAMN12384869; E. c. schlegeli: SAMN12384870; E. chrysocome: SAMN12384872; E. filholi: SAMN12384873; E. moseleyi: SAMN12384871; E. pachyrhynchus: SAMN12384875; Eudyptes robustus: SAMN12384876; E. sclateri: SAMN12384874; Eudyptula minor albosignata: SAMN12384880; E. m. minor: SAMN12384879; E. novaehollandiae: SAMN12384878; Megadyptes antipodes antipodes: SAMN12384877; Pygoscelis antarctica: SAMN12384868; P. papua: SAMN12384867; Spheniscus demersus: SAMN12384881; S. humboldti: SAMN12384883; S. magellanicus: SAMN12384882; S. mendiculus: SAMN12384884. Data from all of the penguin species are also available from the GigaScience GigaDB database [127].

Additional files

Supplementary Figure 1: Phylogenomic trees.

Supplementary Table 1: Sampling and permitting details of all penguin samples tested.

Supplementary Table 2: Assemblers and Kmer sizes used for each penguin.

Supplementary Table 3: Information of 71 avian transcriptomic samples downloaded from NCBI.

Abbreviations

BLAST: Basic Local Alignment Search Tool; bp: base pairs; BUSCO: Benchmarking Universal Single-Copy Orthologs; CNSA: CNGB Nucleotide Sequence Archive; ExaML: Exascale Maximum Likelihood; Gb: gigabase pairs; kb: kilobase pairs; KEGG: Kyoto Encyclopedia of Genes and Genomes; LINE: long interspersed nuclear element; LTR: long terminal repeat; Mb: megabase pairs; NCBI: National Center for Biotechnology Information; ORF: open reading frame; RAxML: Randomized Axelerated Maximum Likelihood; SINE: short interspersed nuclear element; TRF: Tandem Repeat Finder; UCSC: University of California Santa Cruz.

Ethics approval and consent to participate

All samples were obtained under valid animal ethics permits.

Competing interests

The authors declare that they have no competing interests.

Funding

This project was supported by the National Key R&D Program of China (MOST) grant 2018YFC1406901 and by the Science, Technology and Innovation Commission of Shenzhen Municipality grant No. JCYJ20170817150721687 and JCYJ20170817150239127. T.L.C. was supported by an Otago University postgraduate publishing bursary. G.Z. was supported by the Lundbeckfonden (grant No. R190–2014-2827), Carlsbergfondet (grant No. CF CF16–0663), the Villum Foundation (grant No. 25900), and by the Strategic Priority Research Program of the Chinese Academy of Science (grant No. XDB13000000, XDB31020000). M.T.P.G. was supported by the ERC Consolidator Grant 681396 “Extinction Genomics”.

Authors’ contributions

G.Z. developed the concept; G.Z., D.-X.Z., T.L.C., and H.P. designed the project and wrote the manuscript; L.S.A., J.L.B., M.F.B., P.D.B., T.L.C., Y.C., P.D., U.E., S.R.F., S.G., D.M.H., P.H., T.H., E.K., K.L., G.M., T.M., L.J.N., P.P., P.G.R., D.R.T., H.T., and M.J.Y. collected and/or provided samples; J.L.B., T.L.C., A.H.R., T.H., K.J., B.M., T.S., D.R.T., and G.Z. facilitated sample collection; H.P., S.R.F., M.R.E., M.-H.S.S., and G.P. undertook laboratory work. H.P., X.B., M.F., C.Z., and Z.Y. undertook the bioinformatics work; G.Z., T.L.C., H.P., D.T.K., C.-A.B., M.R.E., P.G.B., M.T.P.G., T.H., J.F.M., R.A.P., A.J.D.T., L.D.S., M.-H.S.S., and P.Q. helped design sampling and project directions. All authors contributed to the final manuscript.

ACKNOWLEDGEMENTS

We thank the following: John Cockrem, Scott Flemming, Helen McConnell, Chris Rickard, Sarah Fraser, Otto Whitehead, Kyle Morrison, and Amy Van Buren for help collecting samples; Jonathan Banks, Kirsten Rodgers, and Jo Hiscock for sample information; Manuel Paredes Oyarzún and Hernán Rivera Meléndez for facilitating permits and sample collection; Lauren Tworkowski, Richard O'Rorke, and Joanna Sumner for facilitating sample collection; Adrian Smith for providing laboratory support to extract 2 DNA samples; Peter Dearden, Neil Fowke, Michael Knapp, Hoani Langsbury, Claire Porima, Nic Rawlence, Paul Scofield, Jonathan Waters, Janet Wilmshurst, and Jamie Wood for informal discussions regarding the New Zealand indigenous consultation; The NewZealand Department of Conservation for facilitating New Zealand indigenous consultation and approving permits, particularly Neil Fowke and Jesse Mason for facilitating permits and/or obtaining past permit details; Brett Gartrell and Pauline Nijman for providing animal ethics details; and the China National Genebank for contributing the sequencing resources for this project. The Penguin Genome Consortium welcomes participation and collaboration for our ongoing work regarding comparative and evolutionary genomics of penguins.

References

1.

Ksepka
DT
,
Bertelli
S
,
Giannini
NP
.
The phylogeny of the living and fossil Sphenisciformes (penguins)
.
Cladistics
.
2006
;
22
(
5
):
412
41
.

2.

Cole
TL
,
Waters
J
,
Shepherd
LD
, et al. .
Ancient DNA reveals that the ‘extinct’ Hunter Island penguin (Tasidyptes hunteri) is not a distinct taxon
.
Zool J Linn Soc
.
2018
;
182
(
2
):
459
64
.

3.

Cole
TL
,
Ksepka
DT
,
Mitchell
KJ
, et al. .
Mitogenomes uncover extinct penguin taxa and reveal island formation as a key driver of speciation
.
Mol Biol Evol
.
2019
;
36
(
4
):
784
97
.

4.

Challies
CW
,
Burleigh
RR
.
Abundance and breeding distribution of the white-flippered penguin (Eudyptula minor albosignata) on Banks Peninsula, New Zealand
.
Notornis
.
2004
;
51
(
1
):
1
6
.

5.

Grosser
S
,
Rawlence
NJ
,
Anderson
CNK
, et al. .
Invader or resident? Ancient-DNA reveals rapid species turnover in New Zealand little penguins
.
Proc Biol Sci
.
2016
;
283
(
1824
):
20152879
.

6.

Mattern
T
,
Wilson
K-J
.
New Zealand penguins – current knowledge and research priorities
.
A report compiled for Birds New Zealand
.
2018
, .

7.

Banks
J
,
Van Buren
A
,
Cherel
Y
, et al. .
Genetic evidence for three species of rockhopper penguins, Eudyptes chrycosome
.
Polar Biol
.
2006
;
30
(
1
):
61
67
.

8.

Frugone
M-J
,
Lowther
A
,
Noll
D
, et al. .
Contrasting phylogeographic pattern among Eudyptes penguins around the Southern Ocean
.
Sci Rep
.
2018
;
8
(
1
):
17481
.

9.

Christidis
L
,
Boles
WE
.
Systematics and Taxonomy of Australian Birds
.
Canberra, Australia
:
CSIRO
;
2008
:
98
.

10.

Cole
TL
,
Rawlence
NJ
,
Dussex
N
, et al. .
Ancient DNA of crested penguins: Testing for temporal genetic shifts in the world's most diverse penguin clade
.
Mol Phylogenet Evol
.
2019
;
131
:
72
79
.

11.

Frugone
M-J
,
López
ME
,
Segovia
NI
, et al. .
More than the eye can see: Genomic insights into the drivers of genetic differentiation in Royal/Macaroni penguins across the Southern Ocean
.
Mol Phylogenet Evol
.
2019
;
139
:
106563
.

12.

Slack
KE
,
Jones
CM
,
Ando
T
, et al. .
Early penguin fossils, plus mitochondrial genomes, calibrate avian evolution
.
Mol Biol Evol
.
2006
;
23
(
6
):
1144
55
.

13.

Mayr
G
,
Scofield
RP
,
De Pietri
VL
, et al. .
A Paleocene penguin from New Zealand substantiates multiple origins of gigantism in fossil Sphenisciformes
.
Nat Commun
.
2017
;
8
(
1
):
1927
.

14.

Stonehouse
B
.
The general biology and thermal balances of penguins
.
Adv Ecol Res
.
1967
;
4
:
131
96
.

15.

Marchant
S
,
Higgins
PJ
.
Handbook of Australian, New Zealand and Antarctic Birds
. Vol.
1
,
Pt. B
.
Melbourne, Australia
:
Oxford University Press
;
1990
.

16.

Boersma
PD
.
Penguins as marine sentinels
.
Bioscience
.
2008
;
58
(
7
):
597
607
.

17.

,

Ropert-Coudert
Y
,
Hindell
MA
,
Phillips
R
,
De Broyer C
,
Koubbi P, Griffiths HJ, Raymond B, Udekem d'Acoz Cd', Van de Putte AP, Danis B, David B, Grant S, Gutt J, Held C, Hosie G, Huettmann F, Post A, Ropert-Coudert Y
et al. ., et al. .
,
Cambridge
,
Scientific Committee on Antarctic Research
et al. .
Biogeographic patterns of birds and mammals
. In:
The Biogeographic Atlas of the Southern Ocean. Scientific Committee on Antarctic Research
.
2014
:
364
87
.

18.

Baker
AJ
,
Pereira
SL
,
Haddrath
OP
, et al. .
Multiple gene evidence for expansion of extant penguins out of Antarctica due to global cooling
.
Proc Biol Sci
.
2006
;
273
(
1582
):
11
17
.

19.

Acosta Hospitaleche
C
,
Reguero
M
,
Scarano
A
.
Main pathways in the evolution of the Paleogene Antarctic Sphenisciformes
.
J South Am Earth Sci
.
2013
;
43
:
101
11
.

20.

Bertelli
S
,
Giannini
NP
.
A phylogeny of extant penguins (Aves: Sphenisciformes) combining morphology and mitochondrial sequences
.
Cladistics
.
2005
;
21
(
3
):
209
39
.

21.

Garcia Borboroglu
P
,
Boersma
PD
.
Penguins: Natural History and Conservation
.
Seattle, WA, USA
:
University of Washington Press
;
2013
:
328
.

22.

Thiébot
JB
,
Cherel
Y
,
Trathan
PN
, et al. .
Coexistence of oceanic predators on wintering areas explained by population-scale foraging segregation in space or time
.
Ecology
.
2012
;
93
(
1
):
12
130
.

23.

Woehler
EJ
,
Cooper
J
,
Croxall
JP
, et al. .
A Statistical Assessment of the Status and Trends of Antarctic and Sub-Antarctic Seabirds
.
Cambridge, UK
:
Scientific Committee on Antarctic Research
;
2011
.

24.

Goldsmith
R
,
Sladen
WJ
.
Temperature regulation of some Antarctic penguins
.
J Physiol
.
1961
;
157
:
251
62
.

25.

Ksepka
DT
,
Ando
T
.
Penguins past, present, and future: trends in the evolution of the Sphenisciformes
. In:
Dyke
G
,
Kaiser
G
, eds.
Living Dinosaurs
.
Oxford, UK
:
Wiley
;
2011
:
155
86
.

26.

Watson
M
.
Report on the Anatomy of the Spheniscidae Collected by HMS Challenger, During the Years 1873–1876
.
Edinburgh, UK
:
Neill and Co
;
1883
.

27.

Taylor
JRE
.
Thermal insulation of the down and feathers of pygoscelid penguin chicks and the unique properties of penguin feathers
.
Auk
.
1986
;
103
:
160
8
.

28.

Sivak
JG
.
The role of a flat cornea in the amphibious behaviour of the blackfoot penguin (Spheniscus demersus)
.
Can J Zool
.
1976
;
54
:
1341
5
.

29.

Sivak
JG
,
Millodot
M
.
Optical performance of the penguin eye in air and water
.
J Comp Physiol
.
1977
;
119
:
241
7
.

30.

Bowmaker
JK
,
Martin
GR
.
Visual pigments and oil droplets in the penguin, Spheniscus humboldti
.
J Comp Physiol A Neuroethol Sens Neural Behav Physiol
.
1985
;
156
:
71
77
.

31.

Meister
W
.
Histological structure of the long bones of penguins
.
Anat Rec
.
1962
;
143
:
377
87
.

32.

Raikow
RJ
,
Bicanovsky
L
,
Bledsoe
AH
.
Forelimb joint mobility and the evolution of wing-propelled diving in birds
.
Auk
.
1988
;
105
:
446
51
.

33.

Schreiweis
DO
.
A comparative study of the appendicular musculature of penguins (Aves: Sphenisciformes)
.
Smithsonian Contrib Zool
.
1982
;
341
:
1
46
.

34.

Frost
PGH
,
Siegfried
WR
,
Greenwood
PJ
.
Arterio-venous heat exchange systems in the Jackass penguin Spheniscus demersus
.
J Zool
.
1975
;
175
:
231
41
.

35.

Groscolas
R
.
Metabolic adaptations to fasting in emperor and king penguins
. In:
Davis
LS
,
Darby
JT
, eds.
Penguin Biology
.
San Diego, CA, USA
:
Academic
;
1990
:
269
96
.

36.

Cherel
Y
,
Gilles
J
,
Handrich
Y
,
Le Maho
Y
.
Nutrient reserve dynamics and energetics during long-term fasting in the king penguin (Aptenodytes patagonicus)
.
J Zool
.
1994
;
234
:
1
12
.

37.

Groscolas
R
,
Robin
JP
.
Long-term fasting and re-feeding in penguins
.
Comp Biochem Physiol A Mol Integr Physiol
.
2001
;
128
:
645
55
.

38.

Gauthier-Clerc
M
,
Le Maho
Y
,
Clerquin
Y
, et al. .
Seabird reproduction in an unpredictable environment: How King penguins provide their young chicks with food
.
Mar Ecol Prog Ser
.
2002
;
237
:
291
300
.

39.

Thouzeau
C
,
Le Maho
Y
,
Froget
G
, et al. .
Spheniscins, avian β-defensins in preserved stomach contents of the king penguin, Aptenodytes patagonicus
.
J Biol Chem
.
2003
;
278
:
51053
8
.

40.

Thomas
DB
,
Fordyce
RE
.
The heterothermic loophole exploited by penguins
.
Aust J Zool
.
2008
;
55
:
317
21
.

41.

Thomas
DB
,
McGoverin
CM
,
McGraw
KJ
, et al. .
Vibrational spectroscopic analyses of unique yellow feather pigments (spheniscins) in penguins
.
J Roy Soc Interface
.
2013
;
10
(
83
):
20121065
.

42.

Cairns
DK
.
Plumage colour in pursuit-diving seabirds: Why do penguins wear tuxedos?
.
Bird Behav
.
1986
;
6
(
2
):
58
65
.

43.

Croxall
JP
.
Energy costs of incubation and moult in petrels and penguins
.
J Anim Ecol
.
1982
;
177
94
.

44.

Barbraud
C
,
Weimerskirch
H
.
Emperor penguins and climate change
.
Nature
.
2001
;
411
(
6834
):
183
6
.

45.

Forcada
J
,
Trathan
PN
,
Reid
K
, et al. .
Contrasting population changes in sympatric penguin species in association with climate warming
.
Glob Change Biol
.
2006
;
12
(
3
):
411
23
.

46.

Fretwell
PT
,
Trathan
PN
.
Emperors on thin ice: Three years of breeding failure at Halley Bay
.
Antarct Sci
.
2019
;
31
(
3
):
133
8
.

47.

Trivelpiece
WZ
,
Hinke
JT
,
Miller
AK
, et al. .
Variability in krill biomass links harvesting and climate warming to penguin population changes in Antarctica
.
Proc Natl Acad Sci U S A
.
2011
;
108
(
18
):
7625
8
.

48.

Lynch
HJ
,
Naveen
R
,
Trathan
PN
, et al. .
Spatially integrated assessment reveals widespread changes in penguin populations on the Antarctic Peninsula
.
Ecology
.
2012
;
93
(
6
):
1367
77
.

49.

Mattern
T
,
Meyer
S
,
Ellenberg
U
, et al. .
Quantifying climate change impacts emphasises the importance of managing regional threats in the endangered yellow-eyed penguin
.
PeerJ
.
2017
;
5
:
e3272
.

50.

Heerah
K
,
Dias
MP
,
Delord
K
, et al. .
Important areas and conservation sites for a community of globally threatened marine predators of the Southern Indian Ocean
.
Biol Conserv
.
2019
;
234
(
1
):
192
201
.

51.

Li
C
,
Zhang
Y
,
Li
J
, et al. .
Two Antarctic penguin genomes reveal insights into their evolutionary history and molecular changes related to the Antarctic environment
.
Gigascience
.
2014
;
3
(
1
):
27
.

52.

Trucchi
E
,
Gratton
P
,
Whittington
JD
, et al. .
King penguin demography since the last glaciation inferred from genome-wide data
.
Proc Biol Sci
.
2016
;
281
(
1787
):
20140528
.

53.

Cristofari
R
,
Bertorelle
G
,
Ancel
A
, et al. .
Full circumpolar migration ensures evolutionary utility in the Emperor penguin
.
Nat Commun
.
2016
;
7
:
11842
.

54.

Cristofari
R
,
Liu
X
,
Bonadonna
F
, et al. .
Climate-driven range shifts of the king penguin in a fragmented ecosystem
.
Nat Clim Change
.
2018
;
8
(
3
):
245
.

55.

Le Bohec
C
,
Durant
JM
,
Gauthier-Clerc
M
, et al. .
King penguin population threatened by Southern Ocean warming
.
Proc Biol Sci
.
2008
;
105
(
7
):
2493
7
.

56.

Jenouvrier
S
,
Caswell
H
,
Barbraud
C
, et al. .
Demographic models and IPCC climate projections predict the decline of an emperor penguin population
.
Proc Natl Acad Sci U S A
.
2009
;
106
(
6
):
1844
7
.

57.

Jenouvrier
S
,
Holland
M
,
Stroeve
J
, et al. .
Projected continent-wide declines of the emperor penguin under climate change
.
Nat Clim Change
.
2014
;
4
(
8
):
715
8
.

58.

Boessenkool
S
,
Austin
JA
,
Worthy
TH
, et al. .
Relict or colonizer? Extinction and range expansion of penguins in southern New Zealand
.
Proc Biol Sci
.
2008
;
276
(
1658
):
815
21
.

59.

Clucas
GV
,
Dunn
MJ
,
Dyke
G
, et al. .
A reversal of fortunes: Climate change ‘winners’ and ‘losers’ in Antarctic Peninsula penguins
.
Sci Rep
.
2014
;
4
:
5024
.

60.

Younger
JL
,
Clucas
GV
,
Kooyman
G
, et al. .
Too much of a good thing; sea ice extent may have forced emperor penguins into refugia during the last glacial maximum
.
Glob Change Biol
.
2015
;
21
(
6
):
2215
26
.

61.

Subramanian
S
,
Beans-Picón
G
,
Swaminathan
SK
, et al. .
Evidence for a recent origin of penguins
.
Biol Lett
.
2013
;
9
(
6
):
20130748
.

62.

Gavryushkina
A
,
Heath
TA
,
Ksepka
DT
, et al. .
Bayesian total evidence dating reveals the recent crown radiation of penguins
.
Syst Biol
.
2017
;
66
(
1
):
57
73
.

63.

Grosser
S
,
Burridge
CP
,
Peucker
AJ
, et al. .
Coalescent modelling suggests recent secondary-contact of cryptic penguin species
.
PLoS One
.
2015
;
10
(
12
):
e0144966
.

64.

Vianna
JA
,
Noll
D
,
Mura-Jornet
I
, et al. .
Comparative genome-wide polymorphic microsatellite markers in Antarctic penguins through next generation sequencing
.
Genet Mol Biol
.
2017
;
40
(
3
):
676
87
.

65.

Ramos
B
,
González-Acuña
D
,
Loyola
DE
, et al. .
Landscape genomics: natural selection drives the evolution of mitogenome in penguins
.
BMC Genomics
.
2018
;
19
:
53
.

66.

Clucas
GV
,
Younger
JL
,
Kao
D
, et al. .
Dispersal in the sub-Antarctic: King penguins show remarkably little population genetic differentiation across their range
.
BMC Evol Biol
.
2016
;
16
(
1
):
211
.

67.

Younger
JL
,
Clucas
GV
,
Kao
D
, et al. .
The challenges of detecting subtle population structure and its importance for the conservation of Emperor penguins
.
Mol Ecol
.
2017
;
26
(
15
):
3883
97
.

68.

Clucas
GV
,
Younger
JL
,
Kao
D
, et al. .
Comparative population genomics reveals key barriers to dispersal in Southern Ocean penguins
.
Mol Ecol
.
2018
;
27
(
23
):
4680
97
.

69.

Younger
J
,
Emmerson
L
,
Southwell
C
, et al. .
Proliferation of East Antarctic Adélie penguins in response to historical deglaciation
.
BMC Evol Biol
.
2015
;
15
(
1
):
236
.

70.

Zhao
H
,
Li
J
,
Zhang
J
.
Molecular evidence for the loss of three basic tastes in penguins
.
Curr Biol
.
2015
;
25
(
4
):
R141
2
.

71.

Zhang
G
,
Li
C
,
Li
Q
, et al. .
Comparative genomics reveals insights into avian genome evolution and adaptation
.
Science
.
2014
;
346
(
6215
):
1311
20
.

72.

Borges
R
,
Khan
I
,
Johnson
WE
, et al. .
Gene loss, adaptive evolution and the co-evolution of plumage coloration genes with opsins in birds
.
BMC Genomics
.
2015
;
16
:
751
.

73.

Jarvis
ED
,
Mirarab
S
,
Aberer
AJ
, et al. .
Whole genome analyses resolve early branches in the tree of life of modern birds
.
Science
.
2014
;
346
(
6215
):
1320
31
.

74.

Grosser
S
,
Scofield
RP
,
Waters
JM
.
Multivariate skeletal analyses support a taxonomic distinction between New Zealand and Australian Eudyptula penguins (Sphenisciformes: Spheniscidae)
.
Emu
.
2017
;
177
:
176
283
.

75.

Bi
K
,
Linderoth
T
,
Vanderpool
D
, et al. .
Unlocking the vault: Next‐generation museum population genomics
.
Mol Ecol
.
2013
;
22
(
24
):
6018
32
.

76.

Stiller
J
,
Zhang
G
.
Comparative phylogenomics, a stepping stone for bird biodiversity studies
.
Diversity
.
2019
;
11
(
7
):
115
.

77.

Edmunds
S
(
2018
):
HiSeq 4000 sequencing protocol. protocols.io
. http://dx.doi.org/10.17504/protocols.io.q58dy9w.

78.

Huang
J
,
Liang
X
,
Xuan
Y
, et al. .
A reference human genome dataset of the BGISEQ-500 sequencer
.
Gigascience
.
2017
;
6
(
5
):
1
9
.

79.

Teh
BT
,
Lim
K
,
Yong
CH
, et al. .
The draft genome of tropical fruit durian (Durio zibethinus)
.
Nat Genet
.
2017
;
49
:
1633
41
.

80.

Heydari
M
,
Miclotte
G
,
Demeester
P
, et al. .
Evaluation of the impact of Illumina error correction tools on de novo genome assembly
.
BMC Bioinformatics
.
2017
;
18
:
374
.

81.

Luo
R
,
Liu
B
,
Xie
Y
, et al. .
SOAPdenovo2: An empirically improved memory-efficient short-read de novo assembler
.
Gigascience
.
2012
;
1
(
1
):
18
.

82.

Gnerre
S
,
Maccallum
I
,
Przybylski
D
, et al. .
High-quality draft assemblies of mammalian genomes from massively parallel sequence data
.
Proc Natl Acad Sci U S A
.
2011
;
108
(
4
):
1513
8
.

83.

Weisenfeld
NI
,
Kumar
V
,
Shah
P
, et al. .
Direct determination of diploid genome sequences
.
Genome Res
.
2017
;
5
:
757
67
.

84.

Simão
FA
,
Waterhouse
RM
,
Ioannidis
P
, et al. .
BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs
.
Bioinformatics
.
2015
;
31
(
19
):
3210
2
.

85.

Smit
AFA
,
Hubley
R
,
Green
P
.
RepeatMasker Open-4.0. 2013–2015.
RepeatMasker Home Page.
http://www.repeatmasker.org. Accessed on 1 June 2019
.

86.

RepeatMasker. RepeatMasker Home Page.

http://www.repeatmasker.org. Accessed on 1 June 2019.

87.

Benson
G
.
Tandem repeats finder: A program to analyze DNA sequences
.
Nucleic Acids Res
.
1999
;
27
(
2
):
573
80
.

88.

Smit
AFA
,
Hubley
RR
,
Green
PR
.
Open-1.0. 2008–2015
.
Seattle, WA, USA
:
Institute for Systems Biology
;
2008
.

89.

Bao
W
,
Kojima
KK
,
Kohany
O
.
Repbase Update, a database of repetitive elements in eukaryotic genomes
.
Mob DNA
.
2015
;
6
(
1
):
11
.

90.

Bird 10,000 Genomes (B10K) Project.

.

92.

Altschul
SF
,
Gish
W
,
Miller
W
, et al. .
Basic local alignment search tool
.
J Mol Biol
.
1990
;
215
(
3
):
403
10
.

93.

She
R
,
Chu
JS
,
Wang
K
, et al. .
GenBlastA: Enabling BLAST to identify homologous gene sequences
.
Genome Res
.
2009
;
19
(
1
):
143
9
.

94.

Birney
E
,
Clamp
M
,
Durbin
R
.
GeneWise and genomewise
.
Genome Res
.
2004
;
14
(
5
):
988
95
.

95.

Edgar
RC
.
MUSCLE: Multiple sequence alignment with high accuracy and high throughput
.
Nucleic Acids Res
.
2004
;
32
(
5
):
1792
7
.

96.

Wheeler
DL
,
Barrett
T
,
Benson
DA
, et al. .
Database resources of the National Center for Biotechnology Information
.
Nucleic Acids Res
.
2006
;
14
(
35, suppl 1
):
D5
12
.

97.

Silva
GG
,
Dutilh
BE
,
Matthews
TD
, et al. .
Combining de novo and reference-guided assembly with scaffold_builder
.
Source Code Biol Med
.
2013
;
8
(
1
):
23
.

98.

Grabherr
MG
,
Haas
BJ
,
Yassour
M
, et al. .
Full-length transcriptome assembly from RNA-Seq data without a reference genome
.
Nat Biotechnol
.
2011
;
29
(
7
):
644
.

99.

Boeckmann
B
,
Bairoch
A
,
Apweiler
R
, et al. .
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003
.
Nucleic Acids Res
.
2003
;
31
(
1
):
365
70
.

100.

Jones
P
,
Binns
D
,
Chang
HY
, et al. .
InterProScan 5: Genome-scale protein function classification
.
Bioinformatics
.
2014
;
30
(
9
):
1236
40
.

101.

Kanehisa
M
,
Sato
Y
,
Furumichi
M
, et al. .
New approach for understanding genome variations in KEGG
.
Nucleic Acids Res
.
2018
;
47
(
D1
):
D590
5
.

102.

Ashburner
M
,
Ball
CA
,
Blake
JA
, et al. .
Gene ontology: Tool for the unification of biology
.
Nat Genet
.
2000
;
25
(
1
):
25
.

103.

Kozlov
AM
,
Aberer
AJ
,
Stamatakis
A
.
ExaML version 3: A tool for phylogenomic analyses on supercomputers
.
Bioinformatics
.
2015
;
31
(
15
):
2577
9
.

104.

Liu
L
,
Yu
L
,
Edwards
SV
.
A maximum pseudo-likelihood approach for estimating species trees under the coalescent model
.
BMC Evol Biol
.
2010
;
10
(
1
):
302
.

105.

Zhang
C
,
Rabiee
M
,
Sayyari
E
, et al. .
ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees
.
BMC Bioinformatics
.
2018
;
19
(
6
):
153
.

106.

Katoh
K
,
Standley
DM
.
MAFFT multiple sequence alignment software version 7: Improvements in performance and usability
.
Mol Biol Evol
.
2013
;
30
(
4
):
772
80
.

107.

Capella-Gutiérrez
S
,
Silla-Martínez
JM
,
Gabaldón
T
.
trimAl: A tool for automated alignment trimming in large-scale phylogenetic analyses
.
Bioinformatics
.
2009
;
25
(
15
):
1972
3
.

108.

Sackton
TB
,
Grayson
P
,
Cloutier
A
, et al. .
Convergent regulatory evolution and loss of flight in paleognathous birds
.
Science
.
2019
;
364
(
6435
):
74
8
.

109.

Stamatakis
A
.
RAxML version 8: A tool for phylogenetic analysis and post analysis of large phylogenies
.
Bioinformatics
.
2014
;
30
(
9
):
1312
3
.

110.

Paradis
E
,
Claude
J
,
Strimmer
K
.
APE: Analyses of phylogenetics and evolution in R language
.
Bioinformatics
.
2004
;
20
(
2
):
289
90
.

111.

Tipene‐Matua
B
,
Henaghan
M.
Establishing a Māori ethical framework for genetic research with Māori
. In:
Henaghan
M
, ed.
Genes, Society and the Future
.
Dunedin, New Zealand
:
Human Genome Research Project
;
2007
:
1
44
.

112.

Wilcox
PL
,
Charity
JA
,
Roberts
MR
, et al. .
A values‐based process for cross‐cultural dialogue between scientists and Māori
.
J R Soc N Z
.
2008
;
38
:
215
27
.

113.

Hudson
M
,
Milne
M
,
Reynolds
P
, et al. .
Te Ara Tika Guidelines for Māori research ethics: A framework for researchers and ethics committee members
,
New Zealand
,
Health Council of New Zealand
.
2010
.

114.

Galla
SJ
,
Buckley
TR
,
Elshire
R
, et al. .
Building strong relationships between conservation genetics and primary industry leads to mutually beneficial genomic advances
.
Mol Ecol
.
2016
;
25
(
21
):
5267
81
.

115.

New Zealand Biodiversity Action Plan 2016 – 2020
.
Department of Conservation
,
Department of Conservation
,
Wellington
.
2016
:
ISBN: 978-0-478-15095-7
.

116.

Waitangi Tribunal.

. Accessed on 27 July 2019

118.

Wong
PB
,
Wiley
EO
,
Johnson
WE
, et al. .
Tissue sampling methods and standards for vertebrate genomics
.
Gigascience
.
2012
;
1
(
1
):
8
.

119.

New Zealand Department of Conservation
.
Iwi/hapū/whānau consultation. https://www.doc.govt.nz/get-involved/apply-for-permits/iwi-consultation/. Accessed on 27 July 2019
.

120.

Greig
E
.
The Māori right to development and new forms of property
.
Ph.D. Thesis. University of Otago
;
2010
.

121.

National Human Genome Institute
.
Reaffirmation and Extension of NHGRI Rapid Data Release Policies: Large-scale Sequencing and Other Community Resource Projects
. .

122.

Toronto International Data Release Workshop Authors
.
Prepublication data sharing
.
Nature
.
2009
;
461
:
168
70
.

123.

Lindblad-Toh
K
,
Garber
M
,
Zuk
O
, et al. .
A high-resolution map of human evolutionary constraint using 29 mammals
.
Nature
.
2011
;
478
(
7370
):
476
82
.

124.

i5K Consortium
.
The i5K Initiative: Advancing arthropod genomics for knowledge, human health, agriculture, and the environment
.
J Hered
.
2013
;
104
(
5
):
500
600
.

125.

Koepfli
KP
,
Paten
B
Genome 10K Community of Scientists, et al. .,
Genome 10K Community of Scientists
The Genome 10K Project: A way forward
.
Annu Rev Anim Biosci
.
2015
;
3
(
1
):
57
111
.

126.

Wang
Y
,
Zhang
C
,
Wang
N
, et al. .
Genetic basis of ruminant headgear and rapid antler regeneration
.
Science
.
2019
;
364
(
6446
):
eaav6335
.

127.

Pan
H
,
Cole
T
,
Bi
X
, et al. .
High-coverage genomes of all extant penguin taxa
.
GigaScience Database
.
2019
. .

Author notes

Contributed equally to this work.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.