Tracking SARS-CoV-2 mutations and variants through the COG-UK-Mutation Explorer

Abstract COG-UK Mutation Explorer (COG-UK-ME, https://sars2.cvr.gla.ac.uk/cog-uk/—last accessed date 16 March 2022) is a web resource that displays knowledge and analyses on SARS-CoV-2 virus genome mutations and variants circulating in the UK, with a focus on the observed amino acid replacements that have an antigenic role in the context of the human humoral and cellular immune response. This analysis is based on more than 2 million genome sequences (as of March 2022) for UK SARS-CoV-2 data held in the CLIMB-COVID centralised data environment. COG-UK-ME curates these data and displays analyses that are cross-referenced to experimental data collated from the primary literature. The aim is to track mutations of immunological importance that are accumulating in current variants of concern and variants of interest that could alter the neutralising activity of monoclonal antibodies (mAbs), convalescent sera, and vaccines. Changes in epitopes recognised by T cells, including those where reduced T cell binding has been demonstrated, are reported. Mutations that have been shown to confer SARS-CoV-2 resistance to antiviral drugs are also included. Using visualisation tools, COG-UK-ME also allows users to identify the emergence of variants carrying mutations that could decrease the neutralising activity of both mAbs present in therapeutic cocktails, e.g. Ronapreve. COG-UK-ME tracks changes in the frequency of combinations of mutations and brings together the curated literature on the impact of those mutations on various functional aspects of the virus and therapeutics. Given the unpredictable nature of SARS-CoV-2 as exemplified by yet another variant of concern, Omicron, continued surveillance of SARS-CoV-2 remains imperative to monitor virus evolution linked to the efficacy of therapeutics.


Introduction
As of March 2022, SARS-CoV-2, the causative agent of COVID-19, has accounted for over 450 million infections and 6 million deaths worldwide (https://covid19.who.int/). SARS-CoV-2 was first identified at the end of 2019 in the city of Wuhan, China, and has since spread with unprecedented efficiency among humans (Hu et al. 2021). In contrast to other RNA viruses, the Coronaviridae family is characterised by relatively high-replication fidelity due to the proofreading activity of their polymerases (Robson et al. 2020). Early analyses of SARS-CoV-2 genomes estimated an evolutionary rate of around 0.001 subs/site/year (two to three mutations per month) (Duchene et al. 2020); however, there is much deviation from this rate across the phylogeny with several outlier lineages, including variants of concern (VOCs), that have rapidly acquired several mutations at a much higher rate than this. The analysis of mutations from virus genome data is important for basic virology (Houldcroft et al. 2017), to identify evolutionary signals associated with mutations prior to experimental and real-world data on clinical outcomes or vaccine effectiveness, and to document and track changes that could alter the effectiveness of therapeutics. At present, almost 9 million genome sequences are now available via the GISAID Initiative, permitting near real-time surveillance of the unfolding pandemic (Shu and McCauley 2017;Meredith et al. 2020).
SARS-CoV-2 showed relatively inconsequential genetic change until late 2020 (MacLean et al. 2021). Subsequently, later months of 2020 were characterised by the emergence, across the globe, of VOCs possessing mutations that altered virus phenotype in terms of transmissibility and antigenicity (Harvey et al. 2021). Concurrently, shifts in the immune profile of the human population likely represented a change in the selective environment evidenced by an increase in dN/dS ratios indicative at positive selection at codons across the genome and notable levels of convergence across the global phylogeny (Martin et al. 2021). The continuing emergence of SARS-CoV-2 variants exhibiting heightened transmissibility or antigenic novelty necessitates tools to detect, describe, and track those antigenic changes and make this information accessible to researchers, public health agencies, and drug and vaccine developers so that the information becomes actionable.
This scientific need led us to create the COG-UK-Mutation Explorer (COG-UK-ME), a web resource that provides tracking of non-synonymous mutations in SARS-CoV-2 genome. COG-UK-ME is based on UK data, and it has been developed by the COVID-19 Genomics UK (COG-UK) consortium-created to deliver largescale and rapid whole-genome virus sequencing to local National Health Service centres and the UK government. COG-UK-ME relies on CLIMB-COVID, a data-centric bioinformatics environment for centralising UK SARS-CoV-2 sequences (Nicholls et al. 2021a). Here, we describe COG-UK-ME and its main functionality. COG-UK-ME currently has around 5,000 users per month, with approximately 30 per cent from the UK, 20 per cent from the USA, and the remainder from other international locations.
COG-UK-ME has three aims: first, to make available amino acid mutations in a user-friendly way, enabling data transparency; second, to report on amino acid variation present in SARS-CoV-2 sequences that have been shown to confer resistance against antibodies or disrupt T cell epitope binding. The third is to report on the emergence of new mutations that have the potential to reduce the effectiveness of some therapeutics that have been granted approval for use. Data accumulating over a time course can be analysed so that trends can be detected and tracked.

Data analysis
COG-UK-ME is a publicly accessible web resource that displays in-depth information and analyses of SARS-COV-2 virus genome mutations and variants. Sequence information is deposited daily on the MRC CLIMB-COVID platform (Nicholls et al. 2021a), which has been generated by the COG-UK Consortium, Wellcome Sanger Institute, public health agencies, and other approved providers. Virus lineages are assigned by using a phylogenetic framework to identify those lineages that contribute most to active spread (Rambaut et al. 2020;O'Toole et al. 2021). Mutations for UK sequences are then analysed on the CLIMB platform and linked with curated data on antigenicity, therapeutics, and drug resistance. The prepared data files are then transferred from CLIMB to a web server and visualised.

Tracking changes in the mutation count
COG-UK-ME shows a browsable dataset of all the amino acid sequence variations in SARS-CoV-2 protein sequences. These are shown for all data, and in the recent past-over the last 28 days-in the UK and in the four UK nations (England, Scotland, Wales, and Northern Ireland) ('Mutation Counts' and 'Mutations by week' tabs). The 'VOCs and VUIs in the UK' tab shows through tables and visualisations the number of sequences of variants under investigation (VUI) and VOCs as designated by the UK Health Security Agency (formerly Public Health England) (https:// www.gov.uk/government/publications/covid-19-variants-genom ically-confirmed-case-numbers/variants-distribution-of-casesdata-last accessed date: 16 March 2022) (Fig. 1A). COG-UK-ME also provides visualisations of the spike protein structure showing the position of the VOC-defining mutations (Fig. 1B). Data are also placed in their geographical context by showing the number of sequences and percentage of variants per region (Nomenclature of Territorial Units for Statistics-NUTS1) ('Geographical distribution' tab).

Spike profile tracking
In addition to tracking the frequency of individual substitutions across the genome and of lineages identified as VOCs or VUIs, changes in the frequency of combinations of spike amino acid substitutions are tracked. Each spike profile is defined as the combination of substitutions compared with the original genotype (Wuhan-Hu-1). Profiles may represent monophyletic lineages or they may have arisen convergently across the phylogeny. Changes in profile frequency over the latest 56-day period are considered. For currently circulating profiles (those sampled within the latest 7 days), a sortable and searchable table includes information on the pango lineage(s) for which the profile has been associated, the number of substitutions comprising the profile, and the count of sequences across the latest 56-day and 28-day periods. The average growth rate (plotted on the y-axis in Fig. 2A) is calculated as the mean percentage change in frequency between each 2-week period within the 56-day period. As growth rates are sensitive to potentially stochastic changes at very low frequencies, we also calculate a statistic that estimates recent expansion or contraction of each profile, calculated over the 56-day period (plotted on the y-axis in Fig. 2B). For each profile, i, the absolute value for this statistic, X i , is calculated using the observed frequency, O i,j , of each profile, i, in each of the most recent 2-week periods, j, according to where E i is the frequency of profile i over the full 8-week period under consideration. Thus, the value calculated is influenced by both the rate of change in profile frequency and the overall frequencies of a given profile and is more robust to stochastic The substitution D614G, which is shared by common descent by all lineage B.1 descendants is italicised. The visualisation is made using a complete spike model (Woo et al. 2020), which is in turn based upon a partial cryo-EM structure (RCSB Protein Data Bank (PDB) ID: 6VSB (Wrapp et al. 2020)).
differences in profile frequency that tend to occur at low frequencies.
This monitoring of spike profiles allows the detection of emerging, potentially advantageous, spikes that might not be detected by surveillance methods conditioned on mutations previously determined to be noteworthy through experimentation or other means. This simple approach is complementary to more sophisticated phylogenetic approaches for the estimation of lineage-specific growth rates. One advantage of this simple non-phylogenetic approach is that the convergent accumulation of a substitution or combination of substitutions on a particular background is identified. Such a scenario could arise when there is strong selective pressure on a genotype (e.g. the introduction of a therapeutic). For example, this approach would quickly alert to the growth of a profile such as Delta + E484K emerging convergently across the Delta phylogeny in response to within-host, immune-mediated selection, even if the instances of this profile are interspersed across the phylogeny.

Antigenic changes
The 'Antigenic changes' tab shows a table listing all mutations in the spike protein present in the UK sequence dataset that have individually been associated with some significant degree of weaker virus neutralisation by convalescent plasma, postvaccination sera, or SARS-CoV-2 spike-specific mAbs (referred to as 'Escape mutations' in Fig. 3). Alongside links to the associated literature for each substitution, a confidence score representing the weight of evidence associated with each substitution is shown: 'high', whenever the antigenic role of mutation is supported by multiple studies, including at least one that reports an effect observed with (post-infection serum) convalescent plasma; 'medium', if the antigenic role of the mutation is supported by multiple studies; and 'low', when the mutation is supported by a single study (Fig. 3). In the 'VOCs + Antigenicity' tab, COG-UK-ME reports the occurrence of additional amino acid substitutions or deletions linked to antigenic change within each VOC (Fig. 4). Relative proportions (expressed as percentages) of sequences carrying specific mutations can give information about the antigenic diversity within a VOC lineage.

T cell epitope mutations
Similar to the 'Antigenic changes' tab, the 'T cell epitope mutations' tab shows amino acid replacements in experimentally proven T cell epitopes both in spike and in other proteins, which have been described in the literature. Data are further filtered based on experimental studies just defining T cell epitopes ('Epitope studies') or those reporting on the impact of specific mutations on T cell recognition ('Reduced T cell recognition'). Also shown are predicted antigen presentation likelihood percentile rank values to the experimentally proposed HLA restriction element based on the NetMHCpan (CD8) and NetMHCIIpan (CD4) 4.1 algorithms (https://services.healthtech.dtu.dk/service.php? NetMHCpan-4.1 and https://services.healthtech.dtu.dk/service. php?NetMHCIIpan-4.0-last accessed date: 16 March 2022) Figure 3. Amino acid substitutions in the spike protein identified in the UK dataset (referred to as 'Escape mutations') that have been associated with weaker neutralisation of the virus by convalescent or post-vaccination plasma/serum or spike-specific monoclonal antibodies (mAbs) or that have been observed to emerge upon exposure to either mAbs or plasma in laboratory experiments. *High confidence (red) refers to an antigenic role supported by multiple studies, including at least one that reports an effect observed with (post-infection serum) convalescent plasma; 'medium' (orange), if the antigenic role is supported by multiple studies; and 'low' (yellow), when the mutation is supported by a single study. The boxes above the table enable filtering by multiple criteria. (Reynisson et al. 2020) for both the wild-type and mutant peptide variants. Here, peptides with predicted percentile rank scores of less than 2.0 for CD8 and less than 5.0 for CD4 are likely HLA binders. Amino acid replacements in any epitope are visualised through logo plots, in which each letter represents an amino acid replacement present in a specific epitope, and its height represents residue frequency. The number below the sequence logo shows the position relative to the start position of the epitope.

Drug resistance
The 'Drug resistance' and 'Ronapreve' tabs show tables and visualisations for those mutations associated with the resistance of SARS-CoV-2 to antiviral treatments (e.g. Remdesivir) and therapeutic mAb cocktails that are currently used in clinical settings (e.g. Ronapreve, cocktail of casirivimab and imdevimab) (Beigel et al. 2020;Sidebottom and Gill 2021). The UpSet plot in the Ronapreve tab allows users to track amino acid substitutions known to affect either casirivimab or imdevimab mAbs and in combination (Fig. 5). Other therapeutics will be added in the future.

Concluding remarks
Bioinformatics resources such as COG-UK-ME play an important role by providing clear and accessible information to those who are tackling the pandemic, including through public health actions and the development of vaccines and therapeutics. COG-UK-ME is unique in presenting data from a densely sequenced population with an emphasis on publicly available data (bioproject accession PRJEB37886 and public alignments https://www.cogconsortium.uk/tools-analysis/publicdata-analysis-2/-last accessed date: 16 March 2022). COG-UK-ME also brings together curated literature on the impact of mutations on various functional aspects of the virus. The COG-UK-ME interface allows users to track mutations that are a potential threat based on a phenotypic impact on virus biology or by conferring resistance to the human immune response, including that boosted by vaccines or antiviral drugs. Rapid analyses of VOCs, e.g. the accumulation of any mutation, can also be obtained from the COG-UK-ME interface. Of particular interest to researchers and for therapeutics are mutations that either have an antigenic role or affect T cell binding. These mutations are intensely monitored by researchers and Public Health Agencies to identify any new variant that could escape the immunity generated by vaccines. Timely identification of VOC/VUI samples can facilitate access to clinical specimens to isolate live virus and serum for further immunological evaluation.
Although amino acid sequence analyses are not sufficient to determine the functional effect of a single mutation on SARS-CoV-2 fitness when taken in isolation, COG-UK-ME strives to collate all the available literature on SARS-CoV-2 mutations and provides data to support experiments that investigate the change in phenotype that these mutations might confer on variants.

Methods
Throughout COG-UK-ME, Wuhan-Hu-1 (NCBI RefSeq NC_045512) is used as the reference sequence for nucleotide coordinates, codon numbering within viral proteins, and wild-type amino acid assignments. Sequences are regularly uploaded onto the MRC-CLIMB platform. Sequences with quality issues are excluded. Amino acid replacements and in-frame indels in each sequence are identified (Nicholls et al. 2021a).

Data preparation
Sequence metadata files are processed on the CLIMB-COVID platform (Nicholls et al. 2021a) using the R statistical programming language (Team 2021) and the Tidyverse collection of R packages (Wickham et al. 2019). Non-UK sequences are filtered out. Amino acid replacements and reference amino acids are counted for all times and for a 28-day period up to and including the latest sequence date for the UK and the four UK nations. Counts are linked with data on antigenic changes, data on therapeutics, epitope data and predicted epitope binding percentile  (Lex et al. 2014). Occurrence is shown in the full UK SARS-CoV-2 genome sequence dataset (A) and in a dataset compiled of sequences in the latest 28-day period (B). Spike amino acid substitutions known to affect either casirivimab or imdevimab mAbs were considered. The upper histogram shows the number of sequences per mutation (dots) or combination of mutations (lines), and the bottom left histogram presents the number of sequences with each specific substitution. Rows are coloured according to the mAb to which the greatest fold decrease in binding was recorded (blue = casirivimab, orange = imdevimab), with a lighter shade indicating a fold decrease of less than 100 and darker shade indicating 100 or greater. rank values. Counts of all amino acids across all positions in the spike protein are prepared for the visualisation of sequence logos.
PANGO lineage name aliases are resolved to the full lineage names using the current designations at https://github.com/covlineages/pango-designation (last accessed date: 16 March 2022). VOC and VUI lineages are counted by day and by week for the UK and the four UK nations, counting AY.x sub-lineages within the Delta VOC hierarchically. VOC lineages are also counted by week and by geographic region according to the 12 Nomenclature of Territorial Units for Statistics first-level regions of the UK (NUTS1). Antigenic amino acid replacements and deletions in the spike protein are counted for VOC lineages, excluding lineage defining replacements, as defined by the UK Health Security Agency at https://github.com/phe-genomics/variant_definitions (last accessed date: 16 March 2022). Following data preparation, the resultant data files are transferred from CLIMB to a web server for visualisation.

Literature search
We searched PubMed, LitCovid, BioRxiv, and MedRxiv using the search term 'SARS-CoV-2' combined with 'mAbs', 'monoclonal', 'convalescent', 'neutralisation/neutralization', 'epitope', and 'antibody' for studies published from January 2020 to July 2021 and manually searched the references of select articles for additional relevant articles ( Figure S2). We also searched BioRxiv, and MedRxiv using combinations of the search terms: 'COVID19', 'remdesivir','favipiravir','molnupiravir','nirmatrelvir','ritonavir','paxlovid','antiviral','binding','efficacy','effective','resistance','resistant','sensitivity','inhibit','evasion','mutation',and 'variant'. Results reporting on SARS-CoV-2 mutations that cause resistance to antiviral drugs were recorded and published on the dashboard. This included many different types of assays and studies: neutralisation assays, receptor binding assays, clinical efficacy studies, transcriptional inhibition assays, and in silico indications of resistance. Antiviral drugs are included in the review if they are clinically approved somewhere in the world or are in Stage 3 clinical trials. This search is repeated each week, allowing the timely updating of the dashboard when new research arises.

Data visualisation
The Shiny framework is used to create the COG-UK-ME web application, hosted in the Shiny Server environment . In order to maximise performance across multiple concurrent users, most values are pre-computed in the data preparation process on CLIMB, with the web application focussing on data visualisation.
The bar charts for VOC lineages and mutations, the geographical maps of VOC lineages, and the scatter plot of spike profiles are generated using ggplot2, with interactive features added using Plotly (2015). The heatmap of antigenic changes in the spike protein is generated using the ComplexHeatmap package (Gu et al. 2016), antigenic replacements, and structural domain classifications. Amino acid replacements in epitopes are visualised as sequence logos using the ggseqlogo package (Wagih 2017). UpSet plots for mutations affecting Ronapreve are generated using the UpsetR package (Lex et al. 2014;Conway et al. 2017). The web application user interface is created using the shinydashboard (Chang and Ribeiro 2021), shinydashboardPlus (Granjon 2021), shinyWidgets (Perrier et al. 2021), and shinyjs packages (Attali 2020).
For the visualisations of the VOC spike mutations on the structure, the file 6vsb_1_1_1.pdb containing a complete model of the full-length glycosylated spike homotrimer in open conformation with one monomer having the receptor-binding domain in the 'up' position was obtained from the CHARMM-GUI Archive (Woo et al. 2020;CHARMM-GUI Archive, 2021). This model is itself generated based upon a partial spike cryo-EM structure (PDB ID: 6VSB (Wrapp et al. 2020)). For visualisation, the model was trimmed to the ectodomain (Residues 14-1164) and the signal peptide (Residues 1-13) and glycans were removed. Figures were prepared using PyMol (Schrödinger-LLC 2010).

Supplementary data
Supplementary data is available at Virus Evolution online.