Tracking and curating putative SARS-CoV-2 recombinants with RIVET

Abstract Motivation Identifying and tracking recombinant strains of SARS-CoV-2 is critical to understanding the evolution of the virus and controlling its spread. But confidently identifying SARS-CoV-2 recombinants from thousands of new genome sequences that are being shared online every day is quite challenging, causing many recombinants to be missed or suffer from weeks of delay in being formally identified while undergoing expert curation. Results We present RIVET—a software pipeline and visual platform that takes advantage of recent algorithmic advances in recombination inference to comprehensively and sensitively search for potential SARS-CoV-2 recombinants and organize the relevant information in a web interface that would help greatly accelerate the process of identifying and tracking recombinants. Availability and implementation RIVET-based web interface displaying the most updated analysis of potential SARS-CoV-2 recombinants is available at https://rivet.ucsd.edu/. RIVET’s frontend and backend code is freely available under the MIT license at https://github.com/TurakhiaLab/rivet and the documentation for RIVET is available at https://turakhialab.github.io/rivet/. The inputs necessary for running RIVET’s backend workflow for SARS-CoV-2 are available through a public database maintained and updated daily by UCSC (https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/).

By default, RIVET uses sensitive search parameters for RIPPLES (i.e, setting --branch-length 3 --num-descendant 5 --parsimony-improvement 3).These parameters require that a node have a branch length of at least 3 mutations and a minimum of 5 descendant tips to be considered for recombination.Additionally, the partial placement parsimony score should improve by at least 3 mutations for a node to be flagged as a potential recombinant.

Estimating date of origin of recombinants and growth scores
Since recombinants discovered through RIPPLES correspond to internal nodes of the MAT, their origin or sampling date is not directly available through sequence metadata.However, if the sequence metadata, which contains the sampling date of each sequence in the MAT, is provided as input by the user, RIVET also launches a parallel Chronumental process (Sanderson, 2021) to build a time tree from the MAT.On RIVET's frontend interface, users can sort the recombinant list based on the origin date to quickly review recombinants that have been inferred to have emerged recently.
Additionally, to help prioritize emerging recombinants of epidemiological interest for the purposes of recombinant lineage identification and tracking, RIVET assigns each detected recombinant a growth score and outputs a ranked list of putative recombinants.The recombinant growth metric below, G(R), for a recombinant node with a set of descendants S is defined below: In the equation above, and correspond to the number of months (30-day intervals) () () elapsed since the recombinant node was inferred to have originated and its descendant  sequence was sampled, respectively.The growth score above, G(R), is computed for each  detected recombinant R, and the final recombinant list is ranked based on descending growth scores.

Efficient RIVET workflow parallelization on the Google Cloud Platform (GCP) and output files
The entire RIVET backend pipeline is contained within a public Docker image that can be massively parallelized across multiple servers on Google Cloud Platform (GCP).In a YAML configuration file provided, the user can specify the number of instances and machine type to run the RIVET job.By default, we run the workflow on two n2d-highcpu-32 instances.Upon initiating, RIVET loads the input mutation-annotated tree (MAT) and conducts a parallel search for long-branches that will be considered for the recombination search.The number of long branches is then automatically partitioned uniformly across the specified number of GCP instances.Each GCP instance searches its range of long branches in parallel for recombination events.Immediately upon completion of the search phase, an automated filtration pipeline begins on the instance to check for potential sequencing and bioinformatic quality issues with each detected recombinant.Once every GCP instance has completed both the search and filtration steps, RIVET aggregates the results from each instance locally, and ranks the recombinant results.

RIVET's frontend implementation details
The RIVET frontend is a Flask application (Grinberg, 2018) that loads and pre-processes the output files generated by RIVET's backend, which includes a tab-delimited recombinant results file, a VCF file containing all the single-nucleotide variants (SNVs) of the trio sequences (recombinant, donor, acceptor) and a tab-delimited descendants file containing a mapping of all trio node ids to their respective set of descendants.RIVET utilizes cyvcf2 (Pedersen and Quinlan, 2017), which is a Python library wrapper around htslib (Bonfield et al., 2021), to enable fast parsing of the input trio VCF file.The RIVET web interface displays the recombination results ranked by growth score in a table format where each row in the table is a detected recombinant.To see the SNVs for a particular recombinant of interest, the user can select a row to dynamically render an interactive visualization built using d3.js that displays the SNVs for the selected recombinant and its two parents, with respect to the SARS-CoV-2 reference (GenBank MN908947.3,RefSeq NC_045512.2).The plot shows all positions where at least one of the trio sequences contains a variant, however the recombinant-informative sites are highlighted where the recombinant matches the donor or the acceptor sequence, for clear visualization of the inferred breakpoint intervals.By clicking the available buttons, any view of the visualization can be downloaded in SVG format, for high-quality publication-ready figures, or copied and pasted directly into lineage proposal GitHub Issues, for example.The SNV visualization also contains several built-in interactive features, such as the ability to query and download the descendants specific to a particular node in the trio by clicking the corresponding track label.RIVET's web interface provides integration with phylogeny viewer tools, namely Nextstrain's Auspice (Hadfield et al., 2018) and Taxonium's Treenome browser (Sanderson, 2022;Kramer et al., 2023).To generate the Nextstrain Auspice view, RIVET selects a random single subtree (default parameter is a subtree containing 10 descendants) from the MAT and queries the UShER web API (UShER.bio) to return its corresponding subtree that can be viewed in Auspice.For the Taxonium view, RIVET queries the Taxonium web API with the selected recombinant and its parental sequences using a custom-built Taxonium JSONL configuration file that is produced as an output of RIVET's backend pipeline using taxoniumtools (https://github.com/theosanderson/taxonium).The configuration file helps highlight the selected trio of sequences in the global phylogeny and color the tips of the phylogeny by their Pango lineage classification annotated in the MAT.