Soup to Tree: The Phylogeny of Beetles Inferred by Mitochondrial Metagenomics of a Bornean Rainforest Sample

In spite of the growth of molecular ecology, systematics and next-generation sequencing, the discovery and analysis of diversity is not currently integrated with building the tree-of-life. Tropical arthropod ecologists are well placed to accelerate this process if all specimens obtained through mass-trapping, many of which will be new species, could be incorporated routinely into phylogeny reconstruction. Here we test a shotgun sequencing approach, whereby mitochondrial genomes are assembled from complex ecological mixtures through mitochondrial metagenomics, and demonstrate how the approach overcomes many of the taxonomic impediments to the study of biodiversity. DNA from approximately 500 beetle specimens, originating from a single rainforest canopy fogging sample from Borneo, was pooled and shotgun sequenced, followed by de novo assembly of complete and partial mitogenomes for 175 species. The phylogenetic tree obtained from this local sample was highly similar to that from existing mitogenomes selected for global coverage of major lineages of Coleoptera. When all sequences were combined only minor topological changes were induced against this reference set, indicating an increasingly stable estimate of coleopteran phylogeny, while the ecological sample expanded the tip-level representation of several lineages. Robust trees generated from ecological samples now enable an evolutionary framework for ecology. Meanwhile, the inclusion of uncharacterized samples in the tree-of-life rapidly expands taxon and biogeographic representation of lineages without morphological identification. Mitogenomes from shotgun sequencing of unsorted environmental samples and their associated metadata, placed robustly into the phylogenetic tree, constitute novel DNA “superbarcodes” for testing hypotheses regarding global patterns of diversity.


Supplementary Data
. Raw contig length distribution for the Celera Assembler and IDBA-UD assemblies, and the nonredundant set generated by merging these in Geneious. The number of contigs longer than 15 kb that could be circularised in each case is shown within the corresponding circular arrow. Supplementary Information S1. Choice of BLAST alignment length cut-off when filtering contigs against MitoDB.
The 1 kb threshold was chosen as a good balance between the correct retention of contigs for the inclusion in later analyses, and the loss of shorter contigs that are truly mitochondrial but do not contribute to the final dataset. Determining what is a "true" mitochondrial contig without a closely related reference sequence can be challenging and we aim to include only the most plausible and informative contigs in the final dataset. To illustrate this we present an analysis based on the blastn results for the Celera Assembler contigs against MitoDB. Here we look at the alignment length as a proportion of contig length, with the assumption that true mitochondrial contigs are likely to have proportionately longer hits to the MitoDB sequences than nonmitochondrial contigs. Three different alignment length cut-offs are tested (100 bp; 500 bp; 1 kb). Figure A indicates that a short alignment length cutoff of 100 bp retains large numbers of contigs that are unlikely to be mitochondrial as the BLAST alignments to the database are short relative to their total length (0-40%). From Figure B we can see that those contigs are also, on average, quite short (<3kb). Including these contigs will likely increase downstream processing complexity with minimal gain in aligned data as most, if not all, will be non-mitochondrial. In contrast, Figure A indicates that alignment length cutoffs of 500bp and 1 kb give very similar alignment proportion profiles, except that with a 1 kb cutoff you retain overall a lower number of contigs, especially in the 90-100% aligned category. This suggests that there are a large number of contigs retained by the 500 bp cutoff that have long alignments relative to their lengths (i.e. are probably mitochondrial) but are <1kb in total length. This is reinforced by the low average length of the contigs in this category in Figure B. We can therefore estimate there are at least 250 mitochondrial contigs of 0.5-1kb from this assembly that we have missed from our analysis (this set has a mean length of 713 bp). Simultaneously, the 500 bp cutoff retains two contigs of 9kb with alignment proportions of <10%. BLASTing these on the NCBI website indicates that they are bacterial but there are no highly similar matches. Using a 1 kb cutoff has correctly filtered these out, but at the cost of losing a large number of short contigs. Given that our aim is to retain contigs that can ultimately be included in the phylogenetic analysis, the loss of these short contigs is not problematic for this study as they would not have met the 2 kb (including nad4l) criterion.

B
A Supplementary Information S2. The effect of quality control on the IDBA-UD assembly.
The IDBA-UD assembly statistics and mitochondrial contig length distributions following two different quality control procedures are outlined in the tables below. In addition, we made comparisons between the circular genomes from the two assemblies. Out of 45 mitogenomes found following QC-2, 39 are >99.9% identical to ones from QC-1. The remaining 6 are circular versions of >15kb linear contigs from QC-1 (at >99% identity i.e. they assemble slightly better under QC-2). 9 of the 10 remaining circular contigs from QC-1 are recovered >10kb at >99% identity under QC-2. The final one was recovered at 100% identity within the PCGs but there are differences in the control region. The contigs from QC-1 were retained for analysis.   Figure S3. Coverage plots derived from mapping quality controlled reads with SMALT at 98% identity.
A. Percentage of bases in non-redundant contigs for the given coverage (maximum of 50X shown). B. Percentage of bases in non-redundant contigs with at least the given coverage (maximum of 50X shown -13% of bases have >50X coverge). C. Mean coverage of mitochondrial contigs from the CA assembly plotted against contig length. D. Mean coverage of mitochondrial contigs from the IDBA-UD assembly plotted against contig length. E. Mean coverage of mitochondrial contigs from the non-redundant set plotted against contig length.  Supplementary Table S2) and 124 novel 10+ protein-coding gene contigs from this study. Reference sequences are labelled with accession number and species name. Novel contigs are labelled with the most precise available taxonomic identificationspecies labels with numbers (and any below familylevel without numbers) derive from morphospecies identifications based on bait sequences, whilst species labels with letters derive from contig position in the full beetle phylogeny (Figure 2), requiring monophyly with named reference sequences. Node labels are posterior probability values.
Supplementary Information S3. Failure to assign two contigs to correctly to family under requirement for monophyly with the Mito240 set.