Abstract
We provide a new automated statistical method for DNA barcoding based on a Bayesian phylogenetic analysis. The method is based on automated database sequence retrieval, alignment, and phylogenetic analysis using a custom-built program for Bayesian phylogenetic analysis. We show on real data that the method outperforms Blast searches as a measure of confidence and can help eliminate 80% of all false assignment based on best Blast hit. However, the most important advance of the method is that it provides statistically meaningful measures of confidence. We apply the method to a re-analysis of previously published ancient DNA data and show that, with high statistical confidence, most of the published sequences are in fact of Neanderthal origin. However, there are several cases of chimeric sequences that are comprised of a combination of both Neanderthal and modern human DNA.
The identification of organic material through comparisons of DNA sequences from a sample to DNA sequences from a database is an important research tool in a number of scientific disciplines. In the zoological and ecological literature, identification of unknown specimens based on cytochrome oxidase I (COI) has become know as DNA barcoding (Floyd et al. 2002; Hebert et al. 2003; Remigio and Hebert 2003; Moritz and Cicero 2004). DNA barcoding has found a wide range of applications, from identification of specimens in conservation biology and molecular ecology to identification of birds that have collided with aircraft. A similar methodology is applied in metagenomics (Tringe and Rubin 2005,Venter et al. 2004; Rusch et al. 2007; Yooseph et al. 2007) where genomic sequences from environmental samples are obtained and compared to database sequences.
The topics of this article are the methodological issues relating to the assignment of DNA sequences to taxa represented in a sequence database. The classical procedure for such identification has been the use of Blast searches (Altschul et al. 1997). There are, however, at least three statistical problems associated with this: (1) Blast searches provide a score based on local alignments and not global alignments, leading to a loss of information; (2) Blast searches ignore the population genetic and phylogenetic issues associated with species identification; and (3) the measures of confidence associated with Blast searches represent significance of local sequence similarity and not significance of taxonomic assignment. Blast thus offers no information to help researchers choose among multiple close matches. Whereas the local alignment problem can be circumvented using global alignments, the remaining two problems cannot be addressed without a statistical evaluation of the phylogenetic associations among species.
Several new methods have been developed that attempt to address the problems associated with the use of Blast to identify sequences (Matz and Nielsen 2005 Steinke et al. 2005; Nielsen and Matz 2006; Abdo and Golding 2007); most of these methods focus on identifying species affiliation. This question is difficult to address as the evolutionary relationship among genetic markers may not truly reflect the evolutionary relationship among species. In cases where reciprocal monophyly cannot safely be assumed, an analysis quantifying within- and between-species genetic variation forms a more correct basis of assignment. Such analyses, however, require a comprehensive database coverage that is generally not available to the biologist. In this article we describe a purely phylogenetic solution to the DNA barcoding problem. We will not address the species problem but instead attempt to devise an automated method for the assignment of sample sequences to taxa based on the position of the sample sequence in the phylogeny of life. This method leads to improved accuracy and, importantly, it provides a measure of statistical confidence associated with the barcoding assignment.
Methods
Sequences can be assigned to taxa using a number of different statistical frameworks. Here we pursue a Bayesian approach that allows us to estimate the probability that the sample sequence is part of a monophyletic group, identified with database sequences. We will thus not address the population genetic questions latent in species assignment but reduce the question to a purely taxonomic, or cladistic, question of assigning the sample sequence to a particular clade in an established phylogeny. The procedure is summarized graphically in Figure 1 and described in detail below.
Flowchart of the assignment procedure. A set of homologues is compiled using information from Blast searches and annotation from NCBI's Taxonomy database. The relevant sequences are retrieved from GenBank and aligned using ClustalW. Based on the resulting multiple alignment a large number of phylogenetic trees are sampled and these are then used to calculate posterior probabilities of assignment.
Flowchart of the assignment procedure. A set of homologues is compiled using information from Blast searches and annotation from NCBI's Taxonomy database. The relevant sequences are retrieved from GenBank and aligned using ClustalW. Based on the resulting multiple alignment a large number of phylogenetic trees are sampled and these are then used to calculate posterior probabilities of assignment.
In the Bayesian framework (e.g., Pawitan 2001), the relevant probability of interest is the posterior probability that the query species belong to a particular taxonomic group:
where X is the sample-sequence, Ti is taxon i, and D is the set of database sequences representing k disjoint groups. Because the denominator contains a sum over sequences represented in a database, the probability calculated using this approach is the probability of assignment to a taxonomic group given that the sequence has to be assigned to one of the groups represented in the database.The posterior probability involves a summation over all possible phylogenetic trees and, for each tree, a multiple integral over all combinations of substitution parameters. Hence, the posterior probability cannot be evaluated analytically. However, Markov chain Monte Carlo (MCMC; e.g., Huelsenbeck and Ronquist, 2001) can be used to sample trees in proportion to their posterior probabilities. The fraction of the time the MCMC sampler visits trees that place the sample sequence within a specific monophyletic group (X∈Ti) is a valid approximation of the posterior probability that the query sequence falls within that group.
Ideally, each sample sequence should be compared to the entire tree of life or as much of it as is represented in the available sequence database. For obvious reasons this is not possible, and a heuristic is required to extract a limited representation of the database. To this end we use sequence homology between the sample sequence and sequences obtained using remote Blast searches against GenBank. A taxonomic annotation for each homologue is retrieved from NCBI's taxonomy browser. Homologues with insufficient taxonomic annotation are disregarded.
The vast majority of taxa represented in the sequence database are not relevant to the analysis because the posterior probabilities of grouping monophyletically with these taxa are not appreciably large. The bulk of sequence homologues representing these taxa can be avoided by including only homologues with a Blast score of at least half that of the best matching homologue.
More often than not, however, this relative similarity cutoff does not reduce the number of sequence homologues to a set that can be handled computationally. To obtain the best possible taxonomic coverage in a limited set, only the best-matching sequence homologue for each species is included. If available, up to 30 different species homologues are included. If, at this point, the relative cutoff described above has not been reached, up to 20 homologues providing further taxonomic diversity are added progressively including up to 10 genera, six families, five orders, three classes, and two phyla in the set. If the relative cutoff is reached before 50 homologues have been included in the set, additional sequences are added for the species already represented in the set by including homologues previously rejected as suboptimal representatives for the species.
The analysis is discontinued if the compiled set does not include at least five Blast hits with an E-value below 0.1. An alignment of the sample sequence and the set of homologues is produced using ClustalW in slow/accurate mode with default parameters.
Like any other comparable method, our approach can only assign sequences to taxonomic groups represented in the database. Hence, if only a single taxon represents the clade in which the sample sequence belongs, the sample sequence will be assigned to this taxon with probability one. We have in our approach made no attempt to model the structure and sampling representation of the databases to evaluate the probability that the sequence truly belongs to some other taxon not represented in the database.
A computer program, written in C++ by J.P.H., performs the MCMC analysis. This program takes as input the sequence alignment and a file describing any constraints on the topology of the tree. The constraints are of the form of a backbone constraint. In other words, the constraint tree may include only a subset of the sequences included in the alignment. Here, all sequences except the sample sequence are included in a constraint tree specified by the taxonomic annotation. The program assumes that nucleotide substitutions occur according to the general time reversible model (2001) and assumes that the rate of substitution at a site is a random variable drawn from a mean-one gamma distribution (Yang 1993; Yang 1994). The Markov chain explores the space of all of the parameters of the model, including the substitution rates, nucleotide frequencies, gamma-shape parameter, and topology/branch lengths of the tree subject to the specified constraints. The proposal mechanisms for all of the non-tree parameters have been described elsewhere (e.g., Huelsenbeck et al. 2004). We propose new topologies using a stochastic variant of the SPR (subtree pruning and regrafting) tree perturbation often used to find optimal trees in a parsimony or maximum likelihood framework. Ten thousand unrooted trees sampled from the MCMC analysis are analyzed to obtain posterior probabilities of assignment to all taxa represented in the compiled set of homologues.
The retrieved taxonomic annotation is mapped onto each sampled tree by associating each clade in the tree with the taxon with lowest taxonomic rank that includes all sequences in the clade (see Fig. 2). By assuming the rooting implicit from the taxonomic annotation the sister clade to the sample sequence is identified. For some trees the position of the root relative to the sample sequence cannot be deduced from the taxonomic annotation. In these cases the taxonomic assignment of all sequences in the tree is recorded. The posterior probability of forming a monophyletic group with a given taxon is then calculated as the fraction of sampled trees where the sister clade to the sample sequence is a member of that taxon.
Assignment of the sample sequence in each sampled tree is done by assuming the root implied by the taxonomic annotation of homologues and then recording the consensus taxonomy for all members of the sister clade from the highest taxonomic level to the most specific level shared by all clade members.
Assignment of the sample sequence in each sampled tree is done by assuming the root implied by the taxonomic annotation of homologues and then recording the consensus taxonomy for all members of the sister clade from the highest taxonomic level to the most specific level shared by all clade members.
The posterior probability serves as a confidence measure associated with each assignment and has a straightforward statistical interpretation as the posterior probability that the assignment is correct given the available sequence information and a uniform prior on tree topology. Posterior probabilities are produced for all levels of taxonomic annotation. This allows the sample sequence to be assigned to a higher ranking taxon, such as genus or family, in cases where homology information is too ambiguous to allow a reliable assignment at the species level. The implementation of our approach, SAP (Statistical Assignment Package), generates scalable vector graphics summarizing assignment results. An example of this is shown in Figure 3.
Graphic representation of assignment. The taxonomic tree shows all taxa obtaining positive probabilities of assignment. For clarity, assignment probabilities below 50% are shaded. In the example shown, sequence evidence is substantial but too ambiguous to allow a reliable assignment at the species and genus level. The evidence at family level, however, is decisive.
Graphic representation of assignment. The taxonomic tree shows all taxa obtaining positive probabilities of assignment. For clarity, assignment probabilities below 50% are shaded. In the example shown, sequence evidence is substantial but too ambiguous to allow a reliable assignment at the species and genus level. The evidence at family level, however, is decisive.
The computational time to compile a homologue set relies heavily on a number of external factors such as the current response time of the online Blast server and bandwidth of the Internet connection for retrieval of sequences and annotation. On a 2-GHz Intel processor, the alignment of fifty 1000-bp sequences in ClustalW takes about 2 minutes. The sampling of trees amounts to about an hour and represents the bulk of the computational time for the full analysis. The post-processing of the MCMC output may take up to 10 minutes.
The software can be accessed at http://fisher.berkeley.edu/cteg/software/munch.
Results
Benchmarking
A benchmark analysis was carried out by assigning a data set of cytochrome oxidase I (COI) and tRNA-Leu (trnL) sequences to taxa. All COI entries for the class Insecta (true insects), and all trnL entries for the class Liliopsida (monocots) are downloaded from GenBank. Taxa represented by only one sequence in GenBank as well as database entries not explicitly targeting the relevant genes are not retrieved. The correct taxonomic annotation associated with each entry was downloaded from NCBI's Taxonomy database. From the 10,804 Insecta and 640 Liliopsida sequences, 500 are randomly chosen from each set to serve as test sample sequences. Taxonomic assignment of each sample sequence was performed as described, with the exception that the sample sequence itself was disregarded when identified as a homologue in GenBank.
The distribution of posterior probabilities associated with correct and wrong assignments are shown in Figure 4. At the levels of species, genus, and family, 90%, 99%, and 99% of assignments of Insecta sequences are correct and 51%, 90%, and 100% of assignments of Liliopsida sequences are correct. The false assignments generally have low probabilities and 86% of correct assignments of Insecta sequences and 60% of correct Liliopsida assignments have posterior probabilities above 0.95. The few false assignments primarily arise when lineage sorting disrupts the true phylogenetic relationship between taxa. False assignments may also arise when the correct taxon and one or more wrong taxa all obtain equally high assignment probabilities. In these cases, the small error in the estimation of assignment probabilities may cause that of a wrong taxon to be marginally greater than that of the correct one, resulting in an incorrect assignment. This problem, however, only affects assignments with probabilities below 0.5. A global alignment may not always constitute an optimal alignment of all homologues to the sample sequence so that the relative distances to the sample sequence are all represented correctly. However, only the part of each homologue corresponding to the sample sequence is submitted to the multiple alignment leaving little room for incorrect alignment. In addition, the clustering algorithm used by ClustalW assures that faulty alignment is least likely to occur between the most similar sequences in the multiple alignment. This minor source of error is therefore expected to mainly affect assignment in cases where the homology evidence is ambiguous and will thus rarely if ever affect unambiguous assignments based on probabilities over 90%. As a safeguard, the alignment is presented to the user together with the assignment results and should be inspected whenever possible.
Distributions of assignment probabilities for correct and wrong assignments. At the levels of species, genus, and family, 90%, 99%, and 99% of assignments of Insecta sequences are correct and 51%, 90%, and 100% of assignments of Liliopsida sequences are correct. Wrong assignments are generally associated with low probabilities, whereas most correct assignments achieve probabilities above 95%.
Distributions of assignment probabilities for correct and wrong assignments. At the levels of species, genus, and family, 90%, 99%, and 99% of assignments of Insecta sequences are correct and 51%, 90%, and 100% of assignments of Liliopsida sequences are correct. Wrong assignments are generally associated with low probabilities, whereas most correct assignments achieve probabilities above 95%.
To compare the performance of our approach to that of simple Blast searches, all sample sequences are assigned using new Blast searches. To our knowledge there is no canonical way to use Blast for taxonomic assignment. Here we use the taxonomic annotation associated with the best Blast hit to GenBank, disregarding matches to the sample sequence itself. Blast results were retrieved using remote Blast. In cases of equally high-scoring hits to multiple species, one of these was chosen at random to form the basis of assignment.
Figure 5 compares the two approaches by plotting the tradeoff between sensitivity and specificity in the range of most to least stringent assignment criteria used. Sensitivity is the fraction of sample sequences that are correctly assigned, whereas specificity is the fraction of accepted assignments that are correct. The posterior probability of assignment provided by SAP allows rejection of assignments that do not exceed a minimum assignment probability cutoff. Increasing the stringency of this assignment criterion imposes a more conservative sensitivity-specificity tradeoff. For Blast, the assignment criterion used was a maximum log(E-value) cutoff. The so called ROC plots in Figure 5 show how specificity of SAP can be raised at the expense of sensitivity by changing the assignment probability cutoff from zero to the maximal probability obtained in the analysis. For the Insecta set, sensitivity of Blast was almost identical to that of SAP when all assignments where accepted. For all other sensitivity-specificity combinations, however, the performance of SAP exceeded that of Blast. At the most permissive assignment criteria, the overlap in correct assignments of Insecta sequences was almost complete, with only 3% specific to SAP and 4% to Blast. For the Liliopsida set, the overlap was smaller, with 20% of correct assignments specific SAP and 14% to Blast. The proportion of wrong Blast assignments avoided as a function of posterior probability cutoff (Fig. 6) shows that a large proportion of wrong Blast assignments would be rejected using a stringent assignment criterion in our approach.
ROC (receiver operating characteristic) curves summarizing the tradeoff between sensitivity and specificity in the range of most to least stringent assignment criteria used. Sensitivity is the fraction of all sequences that are correctly assigned, specificity is the fraction of assignments that are correct. The performance of SAP exceeds that of Blast for any sensitivity-specificity combination except when blindly accepting all assignments.
ROC (receiver operating characteristic) curves summarizing the tradeoff between sensitivity and specificity in the range of most to least stringent assignment criteria used. Sensitivity is the fraction of all sequences that are correctly assigned, specificity is the fraction of assignments that are correct. The performance of SAP exceeds that of Blast for any sensitivity-specificity combination except when blindly accepting all assignments.
The proportion of wrong Blast assignments otherwise made that are avoided using different posterior probability cutoffs.
The proportion of wrong Blast assignments otherwise made that are avoided using different posterior probability cutoffs.
Even though the curves for E-value cutoffs correlate with specificity in the two different sets of sample-sequences, the E-value does not constitute a reliable measure of confidence. The E-value only reports the probability due to chance that there is another database hit with a sequence similarity score greater than the one obtained and offers no information on the relative confidence in taxonomic assignment to one of multiple hits with similar high E-value. Alternatively, a length-normalized bit-score could be used as assignment criterion. This would reflect the sequence identity of Blast hits. However, such a measure would be associated with the same problems.
The main source of wrong assignments using best hit from Blast was that Blast only evaluates local similarity to individual database homologues. Longer imperfect matches may obtain higher scores than shorter perfect matches leaving the homologue representing the correct taxon far down or off the list of Blast hits. This phenomenon also explains the striking difference in SAP's sensitivity between the Insecta and Liliopsida sets. For 31% of all Liliopsida assignments, this problem prevents the correct homologue from being included in the compiled set of homologues. To eliminate this source of error, the immense number sequences to which Blast produce hits would have to be downloaded and these hits would then have to be re-ranked based on individual global alignments to the sample sequence. Unfortunately, information on sequence homology in GenBank can only be accessed through Blast searches greatly hampering any database-driven assignment approach.
Reanalysis of Neanderthal Sequences
The fact that sequence similarity does not generally map well to assignment specificity prompts for a re-analysis of datasets where similarity has formed the basis of taxonomic inference such as in genetic barcoding and metagenomics. A less obvious but equally important application relates to the reconstruction of ancient DNA by tiling shorter sequence segments obtained using PCR or 454 pyro-sequencing. Traditionally, longer ancient DNA sequences are constructed by concatenating many short reads. Authenticity of the ancient DNA is then evaluated based on the concatenated sequence and not separately on the individual contributing fragments. Our objective here is to re-examine how much confidence can be assigned to each of the inferred sequences.
We have evaluated the probability of assignment of each of the PCR segments originally used to infer seven known mitochondrial Neanderthal sequences from hyper-variable region I: ElSidron (Lalueza-Fox et al. 2006), Feldhofer1 (Krings et al. 1997), Feldhofer2 (Schmitz et al. 2002), Mezmaiskaya (Ovchinnikov et al. 2000), MontiLessini (Caramelli et al. 2006), Sclandina (Orlando et al. 2006), and Vindija75 (Krings et al. 2000) as well as the two sequences from hyper-variable region II: Feldhofer1 (Krings et al. 1999) and Vindija75 (Krings et al. 2000). For the two Vindija75 sequences, however, the information on PCR sequences was not accessible. The five published sequences not in GenBank: RochersDeVilleneuve (Beauval et al. 2005), LaChapelleAuxSaints, Engis2, Vindija77, and Vindija80 (Serre et al. 2004), are very short and not produced using tiling of shorter segments.
To ensure a maximal coverage of Neanderthal diversity in the compiled set, it was assured that all unique Neanderthal sequences matching the query were included. The Neanderthal database sequence corresponding to the sample sequence was not included in the compiled alignments. The same number of unique homologues was allowed for all other species/subspecies represented. To accommodate the short length of sample sequences, no low complexity filtering was used in the Blast searches, the number of Blast hits considered each time was raised from 200 to 500, and the gap-open penalty in ClustalW alignments was raised to 25. These settings correspond to standard user options of SAP. Results from the analysis are summarized in Figure 7.
Summary of confidence analysis for published Neanderthal sequences. In each sub-figure, a bold bar represents the Neanderthal sequence analyzed. The overlapping boxes above it each represent the assignment probability of the sequence fragment spanned by the box. The dashed box represents the full inferred sequence, whereas shaded boxes represent individual contributing PCR fragments. For the Vindija75 sequences, no information on PCR fragments is available. The five short sequences not in GenBank obtain the following assignment probabilities: Engis2 (HypI): 0.88; LaChapelleAuxSaints (HypI): 0.88; RochersDeVilleneuve (HypI): 0.63; Vindija77 (HypI): 0.87; Vindija80 (HypI): 0.89.
Summary of confidence analysis for published Neanderthal sequences. In each sub-figure, a bold bar represents the Neanderthal sequence analyzed. The overlapping boxes above it each represent the assignment probability of the sequence fragment spanned by the box. The dashed box represents the full inferred sequence, whereas shaded boxes represent individual contributing PCR fragments. For the Vindija75 sequences, no information on PCR fragments is available. The five short sequences not in GenBank obtain the following assignment probabilities: Engis2 (HypI): 0.88; LaChapelleAuxSaints (HypI): 0.88; RochersDeVilleneuve (HypI): 0.63; Vindija77 (HypI): 0.87; Vindija80 (HypI): 0.89.
Whereas it is clear that most fragments with high probability are of Neanderthal origin, it is also clear that the confidence in some of them is low. More worrying, many of the fragments have close to zero percent posterior probability of being Neanderthal, suggesting that these fragments are in fact modern human contaminants. It is clear that the protocol used in ancient DNA studies of constructing larger sequences by concatenating many smaller sequences may easily lead to the artefactual production of chimeric sequences that are part Neanderthal and part modern human.
Discussion
To reliably assign DNA sequences to taxonomic groups, a measure of confidence in the assignment is required. The traditional use of Blast for identification does not yield any such information. In contrast, the posterior probability of assignment supplied by SAP has a straightforward statistical interpretation as a measure of confidence that allows the researcher to judge whether a reliable assignment can be made. The value of being able to reject unreliable assignments is emphasized by the fact that × 80% of wrong Blast assignments otherwise made for the Insecta and Liliopsida sets at species level are rejected using a 0.95 assignment probability cutoff. A further advantage of our approach is that all taxonomic levels are associated with individual measures of confidence. This makes it possible to make a reliable assignment to higher taxonomic levels where sequence information is not sufficient for a reliable species-level assignment.
The Bayesian approach to tree sampling required to obtain a statistically meaningful confidence in assignment is computationally demanding. The motivation for this work was reliability in assignment rather than speed. Nevertheless, to be able to use this approach on very large datasets, such as environmental samples, alternative tree sampling approaches must be explored. A possible alternative could be a modified neighbor-joining algorithm allowing topological constraints. However, although bootstrap scores for assignment most likely correlates with probabilities of assignment, they do not have the same probabilistic interpretation as posterior probabilities. One advantage of the Bayesian approach is that it allows for the possibility of using decision theory to devise criteria for assignment (Abdo and Golding 2007).
Even for genes used for genetic barcoding such as COI and trnL, each species is typically represented in the databases by only very few sequences. In addition, the sample sequence and the database sequences may stem from subpopulations with little ongoing gene flow. A rigorous population genetic analysis would involve assumptions about population structure and demography that may not be valid in general. With the limited amount of information available, a simpler phylogenetic approach, such as the one suggested here, may be an appropriate alternative. In this framework, within-species variability is not modeled and an assignment to a species-level taxon thus makes no implicit statement about how well the corresponding group of database sequences fits a species concept. However, as for other methods that are not based on explicit modeling the population genetics of the species in question, our method may also be misled by incomplete lineage sorting.
Common to all approaches for taxon assignment is the problem that no available reference database fully represents the tree of life. For many taxa the database coverage is still poor and this will lead to false assignments in cases where the correct taxon is not represented in the database. In the analyses presented here, GenBank was used as reference database but the approach can be used with any database of taxonomically annotated sequences. Access to a local version of GenBank or other exhaustive database would greatly increase the speed of the analysis.
In the example analysis provided here, we demonstrated that several published Neanderthal sequences are likely composed of both Neanderthal and modern human DNA. However, we emphasize that most of the published sequences seem to contain 100% Neanderthal DNA, and none of the sequences have more than a few small segments that are likely to be of modern human origin. Nonetheless, it would be desirable in future ancient DNA studies to provide confidence measures for each position in the sequence. Such measures could be obtained using the method described here.








