BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics

Abstract Genomics promises comprehensive surveying of genomes and metagenomes, but rapidly changing technologies and expanding data volumes make evaluation of completeness a challenging task. Technical sequencing quality metrics can be complemented by quantifying completeness of genomic data sets in terms of the expected gene content of Benchmarking Universal Single-Copy Orthologs (BUSCO, http://busco.ezlab.org). The latest software release implements a complete refactoring of the code to make it more flexible and extendable to facilitate high-throughput assessments. The original six lineage assessment data sets have been updated with improved species sampling, 34 new subsets have been built for vertebrates, arthropods, fungi, and prokaryotes that greatly enhance resolution, and data sets are now also available for nematodes, protists, and plants. Here, we present BUSCO v3 with example analyses that highlight the wide-ranging utility of BUSCO assessments, which extend beyond quality control of genomics data sets to applications in comparative genomics analyses, gene predictor training, metagenomics, and phylogenomics.


Background
The falling costs of DNA and RNA sequencing associated with new technologies mean that genomics-based approaches to addressing biological questions are becoming more widely accessible, a so-called 'democratization' of genomics. These sequencing data are subsequently processed by myriad different computational applications into genome or transcriptome assemblies and their annotations. Both the increase in the volume of data and the variable computational approaches to data processing make it increasingly important to be able to gauge the quality of the genomics data being produced. Most technologies and tools provide various metrics to help assess data quality throughout the process, allowing diligent researchers to iteratively tweak workflows and parameters to achieve the best results. However, most of these measures ignore the important complementary question of genomic data quality in terms of gene content completeness.
To address this, Benchmarking Universal Single-Copy Ortholog (BUSCO) assessments were designed to provide quantitative measures of the completeness of genome assemblies, annotated gene sets, and transcriptomes in terms of expected gene content (Simão et al. 2015), http://busco.ezlab.org. BUSCO assessments have been widely adopted by genomics research communities for data quality control procedures.
However, applications extend well beyond such procedures and include building robust training sets for gene predictors, selecting high-quality reference strains or species for comparative analyses, and identifying reliable markers for large-scale phylogenomics studies.
This flexible utility of BUSCO stems from the definition of an evolutionary expectation of gene content. That is, genes that make up the BUSCO datasets are selected from OrthoDB ) orthologous groups with single-copy orthologs in the majority of species for each major lineage. While allowing for rare gene duplications or losses, this establishes an evolutionarily informed expectation that these genes should be found as single-copy orthologs in any newly-sequenced genome as they are evolving under 'single-copy control' (Waterhouse et al. 2011).
This expectation means that if there are many BUSCOs that cannot be identified in a genome assembly or annotated gene set, it is possible that the sequencing and/or assembly and/or annotation approaches have failed to fully capture the complete expected gene content. Furthermore, the identification of large numbers of duplicated BUSCOs with high sequence identity could indicate that the genome assembly procedure has failed to collapse sequenced haplotypes. In this way, BUSCO provides a biologically meaningful completeness metric for genomics data quality control that complements technical measures like contig or scaffold N50 (a summary measure of assembly contiguity where half the genome is assembled into contigs/scaffolds of length N50 or longer).
BUSCO assessments identify complete, duplicated, fragmented, and missing genes, allowing users to quantitatively compare their data to those from gold-standard model organisms or other closely-related species, or to confirm whether their efforts to improve their assemblies or annotations have been effective. These features, and the ability to assess not just genome assemblies but also transcriptomes and annotated gene sets, are innovations that have been widely welcomed by the genomics community. This means that BUSCO has rapidly become well-established as an essential genomics resource for much more than just genome assembly completeness assessment, using up-to-date data across many different species clades from OrthoDB, and with many more applications than the previously very popular, but now discontinued, Core Eukaryotic Genes Mapping Approach, CEGMA ) (see also: http://www.acgt.me/blog/2015/5/18/goodbye-cegma-hello-busco).
In this study, we present a summary of the latest major BUSCO features along with example analysis scenarios that highlight the wide-ranging utility of BUSCO assessments designed primarily for (i) performing genomics data quality control, but also notably applicable for (ii) building gene predictor training sets, (iii) choosing reference strains or species for comparative genomics analyses, and (iv) selecting reliable phylogenomics markers. Waterhouse

BUSCO updates: enhanced features and extended datasets
BUSCO datasets were first defined using orthologs from OrthoDB v7 (Waterhouse et al. 2013) and were subsequently incorporated into the BUSCO assessment tool representing 3,023 near-universally conserved genes for vertebrates, 2,675 for arthropods, 843 for metazoans, 1,438 for fungi, 429 for eukaryotes, and 40 for prokaryotes (Simão et al. 2015). Addressing the growing demands from our users, we subsequently released BUSCO v2 that implemented improvements to the underlying analysis software as well as updated and extended datasets of BUSCOs covering additional lineages based on orthologs from OrthoDB v9 . The analysis software is distributed through GitLab (http://gitlab.com/ezlab/busco), it is also available as an Ubuntu virtual machine, and it has been integrated as an online service for logged-in users at www.orthodb.org. In response to requests from high-throughput users, we then focused on refactoring the code for the BUSCO v3 release to make it more flexible and extendable. This simplifies the setup and installation procedure and introduces the use of a configuration file that makes BUSCO easier to be integrated in high-throughput pipeline workflows.
The BUSCO assessment tool implements an open-source Python-based software to identify and classify near-universal single-copy ortholog matches from genome assemblies, annotated gene sets, or transcriptomes. It employs BLAST+ (Camacho et al. 2009) for sequence searches, HMMER (Eddy 2011) hidden Markov models for profile searches, and Augustus (Keller et al. 2011) for block-profile-based gene prediction. For genome assembly assessments, candidate genomic regions that could harbour BUSCO matches are first identified with tBLASTn searches using BUSCO consensus sequences. These consensus sequences are derived from hidden Markov model (HMM) profiles built from multiple sequence alignments of orthologs and capture the conserved alignable amino acids across the species set (even if some orthologs are incomplete annotations). This represents a key difference from the CEGMA dataset that contains protein sequences from each of six species, while the BUSCO datasets contain consensus protein sequences and variants, i.e. not the actual protein sequences but representations of the conserved alignable amino acids across the species set. This is designed to reduce any potential species bias as any species being assessed that is closely-related to one of the input species would have an unfair advantage if the input species sequences themselves were used instead of a consensus. From BUSCO v2 this search is enhanced using variant consensus sequences if the initial searches returned no complete matches.
Gene model prediction is then attempted for each of the identified regions using Augustus with BUSCO block profiles. The protein sequences of these predicted genes, Complete BUSCO matches must score within the expected range of scores and within the expected range of length alignments to the BUSCO HMM profile, determined by the scores and alignment lengths of the input orthologs themselves. Score thresholds have been optimised to identify true orthologs but if the ortholog is not present or is only partially present (highly fragmented) and a high-identity homolog is present then this homolog could be mistakenly identified as the complete BUSCO.
Matches labelled as fragmented score within the range of scores but not within the range of length alignments. This indicates incomplete transcripts or gene models from transcriptome or annotated gene set assessments. For assembly assessments this indicates that the gene is only partially present or that the sequence search and gene prediction steps failed to produce a full-length gene model. Such fragmented matches are given a 'second chance' with a second round of sequence searches and gene predictions with parameters trained on round one complete BUSCOs, but this can still fail to recover the whole gene. Thus some of these fragmented BUSCOs could be complete but are just too divergent or have very complex gene structures, making them very hard to locate and predict in full.
If classified as missing there were no significant matches or the matches scored below the range of scores for the BUSCO HMM profile. For transcriptome or annotated gene set assessments this indicates that these BUSCOs are missing or the transcripts or gene models are so incomplete that they could not even meet the criteria to be considered as fragmented matches. For assemblies this indicates either that these BUSCOs are missing, or that the sequence search step failed to identify any significant matches, or that the gene prediction step failed to produce even a partial gene model that might have been recognised as a fragmented BUSCO. BUSCOs missing after the first round are given a 'second chance' with a second round of sequence searches and gene predictions with parameters trained on round one complete BUSCOs, but this can still fail to recover the gene. BUSCOs missing from assemblies could be partially present or even possibly complete but they are just too divergent or have very complex gene structures, making them very hard to locate and predict correctly or even partially. Duplicated BUSCOs have more than one complete match that satisfies the criteria of scores and alignment lengths. Rare gene duplications can and do happen, and the selection of BUSCOs from orthologous groups with single-copy orthologs in the majority (>90%) of species for each major lineage means that some duplications are expected.
A high number of duplicates could suggest the presence of non-collapsed haplotypes (alleles) in a genome assembly, but BUSCO results simply report if multiple complete matches are found and thus further independent analyses by the user would be required in order to determine whether this is the case. This is particularly important if users are relying on BUSCO results to select from a variety of different assemblies produced with different parameters or tools or input data, i.e. duplicates should not be 'selected against' simply to achieve a lower BUSCO duplicate score. BUSCO v2 and v3 implement several improvements to the v1 codebase to make assessments faster as well as to provide more comprehensive information on the progress of the analyses and the reporting of any encountered errors or system warnings. For example, additional steps such as formatting the outputs for each predicted gene now take advantage of multiple threads/cores if this option is specified.
New optional arguments allow users to automatically compress large results folders, or to restart failed assessments from the last successfully completed step, e.g. to avoid having to unnecessarily rerun the often time-consuming initial tBLASTn searches. Users may now also pass additional optional arguments to Augustus, e.g. in order to select a non-canonical translation table for the gene predictions. Additionally, visualization of the assessment results is now facilitated with the BUSCO plotting tool that allows users to easily generate configurable bar charts using the ggplot2 R-package (Wickham 2009).
These new features and other options are described in detail in the updated user guide (http://busco.ezlab.org).
With the increased species sampling available at OrthoDB, and the requests from many users, BUSCO was extended to include many more assessment datasets representing additional lineages and providing higher resolution through larger lineage-specific BUSCO datasets. For example, as well as the bacteria-wide dataset, BUSCO now includes 15 additional lineage-specific datasets from Actinobacteria to Tenericutes, and the fungal datasets additionally comprise 9 lineage-specific datasets from Ascomycota to Sordariomyceta. Metazoa is now made up of 12 subsets including vertebrates and arthropods, and additional datasets have been built for nematodes, plants, and protists.
Lineages were selected based on their taxonomic range and coverage in terms of the numbers of available sequenced and annotated genomes, and future releases will include additional lineages for which species sampling becomes rich enough to build reliable BUSCO assessment datasets.

Genomics data quality control
Using universal single-copy orthologs to benchmark newly-sequenced and assembled genomes against those of gold-standard model organisms provides a comparative estimate of assembly quality in terms of gene content completeness. Similarly, BUSCO assessments can also help to guide iterative genome re-assemblies and/or reannotations towards quantifiable improvements. For example, comprehensive efforts to improve the initial draft genome assembly of the Postman butterfly, Heliconius melpomene, resulted in a 4% increase in the number of complete BUSCOs recovered (Davey et al. 2016). A similar improvement was achieved for the Atlantic cod, Gadus morhua, by combining sequencing data from Illumina, 454, and PacBio technologies and using a novel assembly reconciliation method (Tørresen et al. 2017).
As more initial draft genome assemblies are revised using additional sequencing and/or physical mapping data, BUSCO offers not only a metric to consider when selecting from the results of different assembly strategies, but also a complementary quantification of the improvements achieved in terms of expected gene content.
Comparing BUSCO results from assessing initial versus later versions of different genome assemblies and their annotated gene sets shows, in most cases, improvements in the number of complete BUSCOs recovered. The results of BUSCO genome and gene set assessments presented in Figure 1 (main text) use the updates to the chicken (Warren et al. 2017) and honeybee (Elsik et al. 2014) genomes and their annotations to quantify the improvements made after substantial efforts involving many different approaches and supporting datasets.

Gene predictor training sets
Accurate gene model prediction is a complex task, especially when supporting evidence (e.g. native transcripts or homologs from other species aligned to the assembly) is lacking and predictions are performed ab initio. Gene prediction tools such as Augustus (Keller et al. 2011), GENEID (Blanco et al. 2007), GeneMark (Borodovsky and Lomsadze 2011), and SNAP (Korf 2004) need to optimize their parameter configurations for each genome in order to achieve the best results. Optimization of most predictors is usually performed through training steps that use sets of high-quality gene model annotations. Such sets are usually derived from full-length mature messenger RNAs (sequenced from the same species as that of the genome to be annotated) that have been aligned to the assembly. Annotation pipelines that combine several different types of gene model predictions may also undertake training steps, e.g. MAKER (Campbell et al. 2014) can employ the outputs of preliminary annotation runs to automatically retrain and improve its gene prediction algorithm. BUSCOs, being generally widely-and well-conserved genes, offer ideal predefined sets for such training procedures, without the need to first perform RNA sequencing for building initial sets of high-quality gene models.
Examining the effects of using BUSCO-trained parameters on the quality of the resulting gene model annotations reveals improvements, several of which are rather substantial, over using parameters pre-trained on the closest available species (Figure 2 main text, and Figure S1). Each analysis included running Augustus ab initio gene prediction across the whole genome using (i) BUSCO-trained Augustus parameters, and (ii) parameters provided by Augustus for the tested species or the closest available species. The ab initio predicted gene models for each species were then compared to the latest Official Gene Set (OGS) annotations to quantify how well these ab initio genes matched those of each OGS. Additionally, one Augustus run was performed for each genome using the parameters generated by the BUSCO v3.0.0 runs described above. In total two sets of ab initio gene predictions were generated for each of the genomes, one using metaparameters generated by BUSCO v3.0.0 and another using parameters supplied by Augustus belonging to the same species as the genome (e.g. tomato for tomato) or in their absence the closest available relative (e.g. zebrafish parameters for Tetraodon pufferfish).
These predicted proteins were then aligned against the respective reference OGS annotations using BLASTp. For each predicted protein, a coverage score was computed based on how well each prediction aligned to the reference sequences, a coverage score of 100% meaning that every amino acid of a reference protein is found in the ab initio predicted protein with no insertions, deletions or substitutions. As BUSCO employs Augustus for performing gene predictions, running an assembly assessment automatically provides users with the Augustus parameter configurations trained on the first round 'Complete' BUSCO matches. BUSCO results also include general feature format (GFF) and GenBank-formatted gene models for all 'Complete' BUSCO matches, so users can employ these as inputs for training other gene predictors.

Comparative genomic analyses
The rapid growth in the numbers of available genome assemblies means that during the design stages of many comparative genomics studies researchers may be faced with difficult decisions regarding which representative strains or species to include. Total gene counts are not always a good proxy for completeness as selecting those with the highest numbers of genes does not guarantee the best quality: genomes with a large gene counts are not necessarily the most complete and those with fewer genes are not necessarily less complete (Waterhouse 2015).
The choice of representatives will almost certainly be influenced by considerations of taxonomic sampling, the extent and/or accuracy of functional annotations, the availability of functional genomics data pertinent to the analyses, or simply historical usage (e.g. previously designated reference strains/species). However, all else being equal, quantitative assessments with BUSCO offer logical and intuitive selection criteria to help focus on the most complete genomic resources available. Genome assembly completeness is compared here to assembly N50 (half the genome found in scaffolds of length N50 or longer), assembly L50 (the number of longest scaffolds that span half the genome), and the number of annotated protein-coding genes. These results show: • Of the four Lactobacillus designated reference strains (RefSeq), only one achieves 100% completeness, the other three achieve good scores of above 97% but importantly several non-references (some with shorter N50s and higher L50s) display slightly better scores.
• The three reference Aspergillus strains differ markedly in their N50 values but all score above 97% completeness, nevertheless there are several non-references with good scores and good contiguity metrics.

Reliable phylogenomics markers
Estimating true phylogenetic relationships among organisms is a prerequisite to almost any evolutionary study and usually employs mathematical methods to infer evolutionary histories based on evidence in the form of features from extant species (Delsuc et al. 2005). As the numbers of such features increase through the availability of whole genome and/or transcriptome sequences, or targeted genomic regions of interest e.g.
Powerful as this scaling up may be, care must be taken to account for potentially confounding effects from missing data, variable sequence divergence rates or compositions, incomplete lineage sorting or introgression, etc. to reach a confident conclusion. Recent notable examples include extensive transcriptomics to increase species sampling to examine the evolution of insects (Misof et al. 2014;Peters et al. 2017) and spiders (Fernández et al. 2014), and whole genome sequencing to build a well-supported avian phylogeny (Jarvis et al. 2014) and explore gene flow in mosquitoes (Fontaine et al. 2015).
Being near-universal single-copy genes, BUSCOs represent predefined sets of reliable markers where assessments of genomes, annotated gene sets, and/or transcriptomes can identify shared subsets from different types of genomic data for comprehensive phylogenomics studies. For example, phylogenomic analyses confirming that odonates (dragonflies and damselflies) are a sister lineage to neopterans employed single-copy orthologs of banded demoiselle genes from representatives of seven other insect orders as well as BUSCO-identified orthologs from two additional damselfly species for which only transcriptome data were available ).
The analysis of seven annotated rodent genomes together with five rodent transcriptomes illustrates the use of BUSCO for phylogenomics studies (Figure 3, main text). The BUSCO assessments identify conserved single-copy marker genes/proteins that are then used as input data for building a superalignment from which to estimate the species phylogeny. Analyses with the higher-resolution lineage datasets take longer to run but they identify many more single-copy orthologs from the genomes and transcriptomes to use for inferring the species phylogeny.
In each case, the superalignment of these BUSCOs used for phylogenetic inference produces a maximum likelihood phylogeny that agrees with previously published results (Huchon et al. 2007)  In this analysis we deliberately did not pre-filter transcriptomes to maintain only one transcript per gene, this often leads to large proportions of discarded duplicate BUSCOs as seen on Figure 3 (main text), but it demonstrates how a large clade-specific BUSCO dataset can identify enough genes from genomic data of variable quality, thus avoiding the burden of manual and possibly biased selection of orthologs. This example illustrates the utility of BUSCO assessments to relatively quickly and easily identify reliable single-copy markers from different types of genomic data for use in large-scale phylogenomics studies.
The rodent example above used BUSCO in protein mode on the annotated gene sets and in transcriptome mode on the species for which only transcripts were available.
Using BUSCO in genome mode is also a useful approach to identify universal singlecopy markers, e.g. to reconstruct the yeast phylogeny of Saccharomycotina from available genome data, BUSCO assessments of 96 genomes identified more than 1,000 markers for phylogenomic inference (Shen et al. 2016).