Haplogrep 3 - an interactive haplogroup classification and analysis platform

Abstract Over the last decade, Haplogrep has become a standard tool for haplogroup classification in the field of human mitochondrial DNA and is widely used by medical, forensic, and evolutionary researchers. Haplogrep scales well for thousands of samples, supports many file formats and provides an intuitive graphical web interface. Nevertheless, the currently available version has limitations when applying it to large biobank-scale data. In this paper, we present a major upgrade to the software by adding (a) haplogroup summary statistics and variant annotations from various publicly available genome databases, (b) an interface to connect new phylogenetic trees, (c) a new state-of-the-art web framework managing large scale data, (d) algorithmic adaptions to improve FASTA classification using BWA-specific alignment rules and (e) a pre-classification quality control step for VCF samples. These improvements will give researchers the opportunity to classify thousands of samples as usual but providing additional ways to investigate the dataset directly in the browser. The web service and its documentation can be accessed freely without any registration at https://haplogrep.i-med.ac.at.


INTRODUCTION
Available phylogenetic presentations of sequenced samples (so called phylogenetic trees) allow to classify uniparentally transmitted haplotypes into haplogroups ( 1 ). Each phylogenetic ha plo group (or branch) within the tree is identified by a unique set of variants differing from a r efer ence genome. The currently most updated phylogenetic tree for human mitochondrial genomes includes 6401 unique haplo groups ( 2 ), w hich is based on the well-known phylo genetic tr ee Phylotr ee ( 3 ). Pr e viously, we de v eloped Ha plo grep to assign mitochondrial input profiles the best matching haplogroup in an automated way. Ha plo grep works by traversing a phylogenetic tree, calculating the distance to each haplogroup using the Kulczynski metric and returning a sorted list of ha plo group hits for each sample ( 4 ). In 2016, we improved the software by integrating a rule-based system assigning each sample a quality status, supporting new file formats and providing a command-line version, which has also been integrated in se v eral wor kflow pipelines and online services ( 5 ). The set of included featur es, fr equent updates of supported phylogenetic trees and our community support made Ha plo grep to one of the most accurate and widest W264 Nucleic Acids Research, 2023, Vol. 51, Web Server issue used tools for ha plo group classification in mtDNA studies ( 6 ).
Ne v ertheless, ov er the last years we also identified shortcomings with the currently available version: first, genomic databases like gnomAD ( 7 ) provide a rich set of annotations such as variant or population frequencies for evaluating sites of interest. Many of these necessary downstreamanalysis steps are currently not directly supported within the application, making it unnecessarily complicated for researchers to analyse samples initially classified with Haplogrep. Second, modern technologies allow sequencing thousands of samples with decreasing costs, therefore making biobank-scale datasets available to r esear chers. While Haplogrep's classification algorithm works well f or man y samples, the w e b application was not de v eloped for screening thousands of samples directly in the browser and does not provide any summary statistics on the uploaded dataset. Third, w hile Ha plo gr ep alr eady supports numerous phylogenetic trees, the integration of new trees is closely bundled to the software itself, making it complex to add new trees for external users. Forth, input data is nowadays coming fr om heter ogeneous r esour ces (e.g. genotyping arrays ( 8 ), whole-genome sequenced samples ( 9 ), Sanger-sequenced data ( 10 ), long-read data ( 11 )), which r equir es a systematic quality control before ha plo group classification to avoid unexpected pitfalls.
To address these shortcomings, we de v eloped Ha plo grep 3. It includes all key features from pre vious v ersions, eliminates the mentioned shortcomings and is available as a hosted w e b service but can also be run gra phicall y or on the command-line in a pri vate environment. This v ersion will greatly improve how researchers analyse datasets, allowing them identifying input errors or spurious results as early as possible.

Webserver
Ha plo grep 3 is a w e b application de v eloped in Java using the Javalin w e b frame wor k in combina tion with a templa te engine for Java. For visualization, we use the D3.js and Chart.js JavaScript libraries. The main algorithm of Haplogrep has been capsulated into a library (haplogr ep-cor e), integrated into the application as a dependency and has already been described in detail elsewhere ( 4 ). Ha plo grep 3 provides se v eral sub-commands (serv er, classify, distance, tr ees, install-tr ees) to end users, which allows to run it locally or integrate it in automated pipelines. The server command starts a new w e b-server instance, classify allows to run ha plo group classification on the command line, distance calculates the distance between two input ha plo groups, trees returns the currently installed trees and install-trees allows to install new phylogenetic trees. Unlike to previous versions, the same code base is now used for the w e b application and the command-line tool. Ha plo grep 3 loads all r equir ed information from a configuration file in YAML syntax. This file includes basic features like upload limit, port, provided test datasets but also includes the currently installed phylogenetic trees and the location of the phylogenetic trees' repository.

Integration of phylogenetic trees
Starting with Ha plo grep 3, phylo genetic tr ees ar e now hosted in a separate repository ( https://genepi.github.io/ ha plo gr ep-tr ees ). Each tr ee includes a set of r equir ed files for the integration into Ha plo grep. It consists of (a) the tree in XML syntax, (b) variants weights for ha plo group classification, (c) hotspot locations not considered for phylogenetic interpreta tion, (d) annota tion files from external da tabases, (e) r efer ence sequence (e.g. rCRS ( 12 ) or RSRS ( 13 )) and r equir ed BWA files and (f) a set of FASTA alignment rules. This architectural change allowed us to decouple Ha plo grep from a specific phylogenetic tree and will simplify the integration of new trees in future releases. All installed trees are loaded at startup and user can select one of the trees before classification starts.

FASTA impr ov ement and alignment rules
To support FASTA as an input forma t, we integra ted the BWA alignment software ( 14 ) as a JNI library. Unlike to sequence alignment with BWA where insertions and deletions (indels) are left aligned, the currently available mitochondrial phylogenetic trees expect right aligned indels. Ther efor e, indels ar e not corr ectly placed by BWA and r esult in a lower overall ha plo group quality or e v en misclassification. To adjust for that, we integrated a set of curr ently 123 nomenclatur e rules that ar e applied by default prior to ha plo group classification. The rules have been generated by comparing expected and remaining variants from ha plo group-defining samples from the updated Phylotree ( 2 ) in FASTA format. By comparing these variants for each input sample, we were able to create a list of rules which fix issues especially for indel alignments. The list of rules is included as a file in each phylogenetic tree, making it adaptable in future releases. Besides that, Haplogrep 3 has been improved to work with partial FASTA sequences (e.g. control region), a previous shortcoming of the software.

Haplogroup clustering and search
Ha plo grep 3 introduces a new clustering of ha plo groups using the top-le v el ha plo groups (or clusters) as defined by Phylotree and gnomAD. For each input sample or defining ha plo group of an available tree, we calculate the distance to each of the 33 top-le v el ha plo groups and use the cluster with the minimal distance as a result. Using this new tree structur e, users can sear ch for ha plo groups and variants directl y within the application.

Variant annotation
For variant annotation, we used external data from gno-mAD ( 7 ), MitImpact ( 15 ) and the Helix Mitochondrial database ( 16 ), which have been integrated in the application. The files have been downloaded and indexed for accessing variant details in real-time. Each provided phylogenetic tree package also consists of a version number, which allows us to integrate possible future database updates in a reproducible way.

Analysis workflow
Like previous versions, users can upload their sample data in a text format (hsd), as FASTA or in the variant calling format (VCF). The server assigns each sample the top ha plo group hits (currentl y 20) by traversing through the selected phylogenetic tree. The input format of the samples is autodetected by the software but can also be set manuall y. This is especiall y useful for VCF files, w here users can specify if input files originate from genotyping arrays or how heteroplasmic positions from next-generation sequencing data should be handled by the system. If the samples come from genotyping arrays, Ha plo grep 3 automatically adapts the search range to the available positions included in the VCF file. Before classification, users can select one of the available Ha plo grep metrics on how distances to haplogroups are calculated, the phylogenetic tree and if additional output formats should be generated. Ha plo grep then executes a quality control (QC) step and calculates summary statistics for all samples. After the actual ha plo group classifica tion, an annota tion step is ex ecuted and users ar e forwarded to the results including a summary dashboard for visual analytics (e.g. data grouped by top-le v el haplogroups including population information, QC statistics) and the sample details (see Figure 1 ).

Summary dashboard
Ha plo grep 3 includes a graphical summary dashboard including statistics regarding the sample classification. It consists of a graphical ov ervie w of all individual haplogroups and their distribution into the 33 top-le v el ha plo groups (see Materials and Methods). The dashboard also consists of a table of all uploaded samples showing the population composition for each top-le v el ha plo group with ancestry information provided by gnomAD. Besides that, it also includes all availab le e xport formats including the possibilities to download the sample QC report as a tab-delimited file. Unlike to previous versions, export formats are generated onthe-fly during classification and can be downloaded without an y dela y.

Sample ov ervie w
Besides the summary dashboard, the individual samples page includes details for each sample (e.g. number of expected variants, found variants, additional variants), the top 20 hits for each input sample and the analysed range. To investigate samples, Ha plo grep 3 allows to anal yse each variant associated with a ha plo group by clicking on it. This will allow users to access variant or population frequencies and evaluate variants in more detail (see 'Interacti v e Variant Discoveries').

Phylogenetic trees
Ha plo grep 3 supports the integration of external phylogenetic tr ees. We curr ently provide a set of fiv e mtDNA trees within Ha plo grep 3, all managed in separate Git repositories for reproducibility ( https://genepi.github.io/ha plo greptr ees ). Each tr ee consists of a configuration file and a set of annotation files (see Material and Methods). This architectural change allowed us to decouple Ha plo grep from the phylogenetic trees and simplifies the integration of new trees for external users. While all tr ees ar e curr entl y mtDN A specific, Ha plo grep 3 is not limited to mtDNA allowing other phylogenetic r epr esentations to benefit from its features. Beginning with this version, we also provide a new feature to make phylogenies searchable for end users. Users can use Ha plo grep 3 to navigate through all top-le v el ha plo groups and access textual and graphical r epr esentations for each ha plo group including expected variants.

VCF pre-classification quality control
VCF files can originate from different projects (e.g. micr oarray pr ojects, w hole-exome / w hole-genome or longread sequencing projects) and are often missing an external quality-control step before classification. This can result in a lower ha plo group quality or e v en in a misclassification. The latest version of Ha plo grep now integrates an initial validation step for VCF samples and includes (a) an analysis of the uploaded sample file (e.g. number of samples, number of variants, overlap with the phylogenetic tree, monomorphic variants), (b) a check on possible strand-flips and (c) statistics on the variant or sample call rate.

Inter active v ariant disco veries
Each phylogenetic tree includes a set of annotation files. Ha plo grep 3 displays all available information directly in the w e b-a pplication, w hen a specific variant is selected (see Material and Methods). Access to variant frequencies allow users to evaluate each variant in detail and assess its clinical relevance using functional predictors from MitImpact ( 15 ). Ha plo grep 3 also provides direct links to external r esour ces f or further in vestigation and allows to export a set of predefined annotations.

Analysing 1000 genomes data with Haplogrep 3
The new w e b application allows to investigate samples in mor e depth dir ectly in the browser. Her e, we ar e analysing the 1000 Genomes data, which have been made available by the 1000 Genomes Project Phase 3 ( 17 ) and includes a set of 2534 samples. All samples have been classified in 88 seconds on the publicly available w e b service using the default parameters (see Supplementary Table S1). The ne wly created dashboar d shows an ov ervie w of all samples and indicated that all samples are passing QC. Samples are clustered by top-le v el ha plo group or by individual ha plo group. All 33 clusters as defined by Phylotree are represented. 11% of the uploaded samples include a warning, mainly because of a slightly decreased haplogroup quality (see Figure 2 ).
The details tab of Ha plo grep 3 shows how a sample report is structured. It includes the ha plo group top hit, expected mutations of the haplogroup and remaining mutations of each sample, the top 20 additional ha plo group hits and the analysed sample range. It also includes frequencies for each variant (provided by publicly available databases) for further investigation (see Figure 3 ).

DISCUSSION
Published in 2010 and 2016, Ha plo grep was the first fully automated ha plo group classification tool for mtDN A. The number of citations and downloads shows that Ha plo grep is a critical tool for numerous r esear ch ar eas and supports studies at any size. With the upgrade presented in this paper, we go a step further and provide users new tools for ev aluating v ariants and new possibilities to look at their data. We also improved the FASTA alignment for partial samples, added a new QC step for VCF samples and decoupled the phylogenetic trees from the application itself. Since our original publication, many tools have been published showing a similar set of features or provided functionally w hich were initiall y not available in Ha plo grep. Ne v ertheless, none of the currently available tools include the same set of features compared to Ha plo grep. Ha ploGrouper ( 18 ) works for any kind of phylogenetic trees but supports only VCF and is not available as a graphical w e b service. Mito-Suite ( 19 ) r equir es a local installation and works only for next-generation sequencing data in BAM format. Haplo-Tracker ( 20 ) works well for partial FASTA sequences, a feature which has also been integrated in Haplogrep 3. Hap-loCart1.0 ( 21 ) uses a novel and promising pangenome refer ence structur e eliminating problems when the sample is identical to the rCRS but is currently too computational e xpensi v e to run on a large set of samples. MitoSuite ( 22 ) uses Ha plo grep for mtDN A classification but also includes se v eral tools to further investigate variants. Many of these annotations are now also available within Ha plo grep 3.
W hile the integra tion of new trees is now simplified, the creation of phylogenetic trees can pose numerous challenges Nucleic Acids Research, 2023, Vol. 51, Web Server issue W267 to end users. Since Ha plo grep is a constantl y de v eloping service since > 10 years, we are planning to provide automated ways to create the r equir ed files, which can then be connected to Ha plo grep. Overall, we think that the new set of features will greatly improve and simplify the work with Ha plo grep and provide users for the first-time functionality for downstream analysis directly within the application.

SUPPLEMENT ARY DA T A
Supplementary Data are available at NAR Online.