Spatio-temporal dynamics of intra-host variability in SARS-CoV-2 genomes

Abstract During the course of the COVID-19 pandemic, large-scale genome sequencing of SARS-CoV-2 has been useful in tracking its spread and in identifying variants of concern (VOC). Viral and host factors could contribute to variability within a host that can be captured in next-generation sequencing reads as intra-host single nucleotide variations (iSNVs). Analysing 1347 samples collected till June 2020, we recorded 16 410 iSNV sites throughout the SARS-CoV-2 genome. We found ∼42% of the iSNV sites to be reported as SNVs by 30 September 2020 in consensus sequences submitted to GISAID, which increased to ∼80% by 30th June 2021. Following this, analysis of another set of 1774 samples sequenced in India between November 2020 and May 2021 revealed that majority of the Delta (B.1.617.2) and Kappa (B.1.617.1) lineage-defining variations appeared as iSNVs before getting fixed in the population. Besides, mutations in RdRp as well as RNA-editing by APOBEC and ADAR deaminases seem to contribute to the differential prevalence of iSNVs in hosts. We also observe hyper-variability at functionally critical residues in Spike protein that could alter the antigenicity and may contribute to immune escape. Thus, tracking and functional annotation of iSNVs in ongoing genome surveillance programs could be important for early identification of potential variants of concern and actionable interventions.

3.9 million deaths and >181 million confirmed cases of COVID- 19 have been reported to the WHO from around the globe. As the virus accumulates novel mutations, probing millions of viral genome sequences has helped track the continual evolution of SARS-CoV-2. By rigorous cataloging of mutations, genomic surveillance has assisted in predicting the upcoming trajectories of infection by aiding epidemiological models with metrics of transmissibility and virulence of viral strains in a population.
One of the differentiating features of RNA viruses is their exceptionally high nucleotide mutation rates that results in a set of closely related but non-identical genotypes, termed as quasispecies, in a host (1)(2)(3)(4)(5)(6). Although random mutations have been shown to be largely deleterious (7,8), some strains may develop enhanced survival ability by evolving mechanisms to resist antiviral responses of the host immune system or drug resistance (9,10). The fitness of the intra-host viral population has been shown to be dependent on the cumulative contribution of all strains, including haplotypes harbouring minor variants (11). Various factors confer viral heterogeneity within a host like low fidelity of RNA-dependent RNA polymerase (RdRp) and RNA editing by host deaminases combined with the intrinsic high-rates of replication of RNA. Moreover, recent studies have shown mutations in the RdRp to affect the overall genome variability in SARS-CoV-2 (12,13) whereas RNA editing by host deaminases has been proposed as a major driver of intra-host heterogeneity in the SARS-CoV-2 transcriptomes (14,15). Understanding these intra-host variations is important as they could potentially alter function through changes in RNA secondary structure, regulatory domains, as well as protein function and thereby govern host-pathogen interactions.
Genomic heterogeneity of the virus within a host can be analysed by capturing intra-host Single Nucleotide Variations (iSNVs), observed as heteroplasmic sites (often referred to in the mitochondrial genome context) in sequence reads. In this study, we accessed samples of COVID-19 diagnosed patients sequenced on an Illumina platform from two different time-periods of the pandemic. In Phase 1, we analysed 1347 samples collected latest by June 2020 from China, Germany, Malaysia, United Kingdom, United States and different subpopulations of India to perceive a genome-wide iSNV map in SARS-CoV-2 infected populations. We observed 16 410 iSNV sites spanning the viral genome, including residues that defined the B.1 and B.6 lineages that were prominent in the listed populations before June 2020. Potential APOBEC and ADAR editing signatures were present throughout the SARS-CoV-2 genome with some samples indicating ADAR hyperactivity. Further, mutations in RdRp and host-mediated RNA editing showed influence on the differential prevalence of iSNVs between hosts. Structural analysis of iSNVs in the Spike protein revealed a high density of alterations in functionally critical residues that could alter the antigenicity and may contribute to immune escape. Interestingly, ∼42% of the iSNV sites identified in these samples were found to be reported as an SNV by 30 September 2020 in one or more samples submitted in GISAID, increasing to ∼80% by 30 June 2021. The likelihood that iSNVs can overtime manifest into SNVs in populations was further substanti- The number of processed samples for each population curated from samples submitted in NCBI SRA till 23 May 2020 (China, Germany, Malaysia, United Kingdom, USA) and samples sequenced in laboratories from Bhubaneswar, Delhi, and Hyderabad (India) latest by 11 June 2020. ated in Phase 2 by analyzing 1774 samples sequenced in India between November 2020 and May 2021. In these samples, iSNVs were found at most of the Delta (B.1.617.2) and Kappa (B.1.617.1) lineage-defining genomic positions prior to their fixation as SNVs by February 2021. These results highlight the importance of recording iSNVs as an extension to the genomic surveillance programs to enable more accurate models for viral epidemiology.

Datasets
We processed Illumina next-generation sequencing samples from two different time-periods to understand the prevalence of SARS-CoV-2 intra-host single nucleotide variations (iSNVs) in infected individuals across various populations and to explore the possibility of iSNVs getting fixed in a population as a variant. For Phase 1, we accessed 1347 publicly available samples submitted in the National Center for Biotechnology Information Sequence Read Archive (NCBI SRA) (16) till 23 May 2020 from China, USA, Germany, Malaysia and the United Kingdom. Samples from laboratories in Bhubaneswar, Delhi and Hyderabad, that had been sequenced latest by 11 June 2020, were also included to represent SARS-CoV-2 infected East, North and South Indian populations respectively (Table 1). For Phase 2, we analyzed 1774 samples sequenced in India between November 2020 and May 2021 for spatio-temporal review of the iSNV states of Delta (B.1.617.2) and Kappa (B.1.617.1) lineage-defining variations.
The GRCh37 human reference genome and NC 045512.2 SARS-CoV-2 Wuhan-Hu-1 reference genome were used for all analyses. High-coverage FASTA sequences submitted in GISAID (17) were accessed to get a position-wise share of SNVs in the global SARS-CoV-2 population using the 2019nCoVR database (18).

Data processing
The sequence files that were retrieved from NCBI SRA were downloaded in the SRA compressed file format Nucleic Acids Research, 2022, Vol. 50, No. 3 1553 and converted to FASTQ format files using NCBI's SRA-Tools (https://github.com/ncbi/sra-tools). Adapter sequences were trimmed and low-quality reads were filtered using Trimmomatic (v = 0.39) for downstream analysis (19). The trimmed reads were then aligned against the human reference genome (GRCh37/hg19) using HISAT2 (20). All paired-end reads unmapped to the human reference genome were filtered out using SAMtools (21) and mapped to the SARS-CoV-2 reference genome (NC 045512.2) using Burrows-Wheeler Aligner (BWA-MEM) (22). Since iSNVs cater to small fractions of reads, to avoid any bias induced by PCR amplification (23), all duplicate reads were removed from the SARS-CoV-2 aligned files using Picard Tools (https://github.com/ broadinstitute/picard). BCFtools (21) was used to generate the consensus SARS-CoV-2 FASTA sequence. Sample quality scores and alignment statistics were obtained using FastQC (24), QualiMap (25) and SAMtools (21) at each step.
Further, samples with <1000 SARS-CoV-2 aligned unique reads and error rates of ≥0.005 were rejected (Supplementary Dataset S2). In the qualifying samples, we used REDItools2 (26) to call iSNVs in SARS-CoV-2 reads with a quality score ≥30. Additional position-level parameters in REDItools2 were used to discard some positions on the basis of site quality. The sites whose reads did not achieve a mean quality score of ≥33 were rejected and sites that were located in long homopolymeric regions of length ≥4 bp (Supplementary Dataset S1) were excluded due to known sequencing errors in these regions (27). From the REDI-tools2 output, high-quality iSNV events were further filtered out using the following thresholds: number of minor allele reads ≥5, base coverage ≥20 and minor allele frequency ≥0.01. In order to test the pipeline's efficiency in identifying iSNVs in SARS-CoV-2 genome, we additionally downloaded and processed 500 single-end NGS samples sequenced in replicates on an Illumina platform from NCBI SRA with the Project ID PRJNA655577. We observed a high concordance between the replicates (Pearson's r = 0.984, Supplementary Figure S1) in sites qualifying the pipeline thresholds. Thus, we used these cutoffs as a qualifying criterion for defining a heteroplasmic site as an iSNV in our subsequent analyses.

SARS-CoV-2 virus culture in Vero E6 cells
Virus isolation. Virus isolation from the patient's nasooropharyngeal swab samples was done following Harcourt et al., 2020 (28). Vero E6 cells were grown in a six-well plate at 80% confluency in DMEM supplemented with antibiotics (Penicillin-Streptomycin) and 10% Fetal bovine serum. Next day, Vero E6 cells were infected with 11 SARS-CoV-2 isolates at MOI of 1 in serum free media for 1 h at 37 • C with gentle rocking every 10-15 min. One hour post infection, the inoculum was aspirated and cells were washed with PBS and supplemented with fresh complete media. The culture supernatant was aspirated 48 hours post infection and recursively used to infect fresh cells for seven times. The cells were washed with 1× PBS and collected in Trizol reagent for total RNA extraction. SARS-CoV-2 virus isolation and culture was conducted in a biosafety Level-3 facility according to the biosafety guidelines issued by the Department of Biotechnology, Government of India.
qRT-PCR for viral load estimation and mRNA expression of APOBEC3B and ADARB1. RNA isolation from culture cells was performed using TRIzol Reagent (Invitrogen, cat. no. 10296010). The isolated RNA was subjected to qRT-PCR for determining the viral load. We performed one-step multiplex real-time PCR using TaqPathTM 1-Step Multiplex Master Mix (Thermo Fisher Scientific, cat. no. A28526), targeting three different SARS-CoV-2 genes with primer and probe sets specific for Spike (S), Nucleocapsid (N) and Open Reading Frame 1 (ORF1).
Viral RNA library preparation and sequencing. Amplicon libraries for viral genome sequencing were prepared using QIAseq FX DNA Library Kit and QIAseq SARS-CoV-2 Primer Panel (Qiagen, cat. no. 180475, cat. no. 333896) as instructed by the manufacturer's manual, and the library was subsequently sequenced using Illumina NextSeq 550 platform. The adapter sequence used for each sample was compatible with Illumina sequencing instruments with 96sample configurations (Qiaseq unique dual Y-adapter kit). The average insert length was in the 250-500 bp range. To characterize iSNVs, the Chlorocebus sabaeus genome (chlSab2 assembly) was used inplace of the human genome in the pipeline.

Identification of hyper-variable samples
Z-score values based on the distribution of number of iS-NVs per sample were calculated for each population using the Python package SciPy (29). Samples with a Z-score value >3 were classified as hyper-variable in each cohort ( Figure 3A).
In order to categorize the specific amino acid change and the proteins containing the iSNVs, they were annotated using SnpEff version 4.5 (30). Conservation analysis of the full-length sequences of proteins was done on the basis of the six other coronaviruses (HCoV-229E, NL63, OC43, HKU1, MERS-CoV, SARS-CoV). The multiple sequence alignment of seven protein sequences was performed by clustal-omega (31). Conservation calculation of Spike amino acids was done using the ConSurf tool (32). The high-risk hyper-variable sites resulting in a single amino acid change in Spike proteins were screened and prioritized on the basis of sequence conservation and frequency, i.e. number of times it is present in the total 22 hypervariable samples. ConSurf conservation scores >6 were considered and a total of 140 variants were filtered for mapping onto the structure to understand the effect on protein function and structure. We utilized PyMol (33) to determine interface residues from the available crystal structures. PDB 6M0J (34) and 6M17 (35) for ACE2 RBD binding and PDB 6W41 (36) and 7BZ5 (37) for antibody binding residues were used from Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) (38).

Data handling and visualizations
Custom python scripts were used to automate the process of downloading and retrieving samples from NCBI SRA as well as to process samples through the pipeline. Python libraries such as NumPy (39), Pandas, Matplotlib (40) and Seaborn were used for data handling and visualizations. SciPy (29) was used to compute pairwise two-sided t-tests and to calculate distribution statistics and Pearson's correlation coefficients.

Intra-host single nucleotide variations (iSNVs) in SARS-CoV-2 genomes
We analysed 1347 transcriptomic samples obtained from patients diagnosed with COVID-19 by June 2020 from different populations (Table 1). 1126 samples could be aligned to the SARS-CoV-2 reference genome. To ensure specificity of iSNV detection, the filtering criteria and cutoffs used in the sequence reads were established by analysing an additional set of 500 samples sequenced in replicates (Supplementary Figure S1). 929 of the 1126 samples (82.5%) harboured one or more iSNVs with frequencies ranging between 0.01 and 0.80 (Supplementary Dataset S3a, b). In these 929 samples, we recorded a total of 47 779 iSNVs with a median of 11 iSNVs per sample ( Supplementary Figure S2A), revealing extensive heteroplasmy in samples. We did not observe a significant correlation between the mean sample coverage depth and the number of iSNVs per sample suggesting that these events are unlikely to be a consequence of sequencing artefacts (Pearson's r = 0.158, Supplementary Figure S2C). Further, in a sample subset, we did not find any correlation between the number of iSNVs in a sample and the Ct values for gene N (Pearson's r = −0.121, P = 0.095), ORF1 (Pearson's r = −0.084, P = 0.251) and S (Pearson's r = −0.060, P = 0.413) (Supplementary Figure  S2G).
Nucleotide changes due to iSNVs presented a skewed pattern, with A-to-G transition accounting for about onefifth of all changes ( Figure 1A) although presenting the lowest median frequency (Supplementary Figure S2D). Moreover, A-to-G changes were observed in disproportionate numbers in few samples while C-to-T and T-to-C changes were more consistently observed with a median of 3 events per sample ( Figure 1A). iSNVs contributed to variability throughout the SARS-CoV-2 genome, both in the coding as well as in the non-coding regions. A total of 16 410 unique iSNVs were recorded at 13 726 genomic positions (2578 polymorphic) (Supplementary Dataset S5) in one or more samples distributed almost evenly in all genes and protein-coding domains of the genome (Supplementary Figure S3A). Though the frequency spectrum of iS-NVs was nearly uniform with respect to the nature of amino acid sequence change, a large fraction of the sites lead to non-synonymous variations ( Supplementary Figure S2E). Gain of stop codons that could alter the amount of functional proteins from the viral genomes were also observed.

Putative role of host and viral factor in iSNV incidence
Various factors have been shown to confer intra-host variability in many families of RNA viruses. Low-fidelity of RNA-dependent RNA polymerase (RdRp) has been established as a major contributor to this heterogeneity. Recent reports showed that mutations in the RdRp of SARS-CoV-2 may impact the number of mutations in the viral genome (12,13). We found significant difference in the number of iSNVs between the wild-type and C14408T (P < 3e-3) and C13730T (P < 2e-6) mutant RdRp samples in India with higher prevalence in the mutant RdRp ( Figure  1B). Differential prevalence of iSNVs between the C14408T RdRp mutation carrying samples in India and USA was also observed (P < 2e-8). Although, we did not find recurrent iSNV sites in mutant RdRp samples (Supplementary Dataset S6).
A bias towards the C-to-T/G-to-A and A-to-G/T-to-C nucleotide changes suggests RNA editing activity in both the strands of SARS-CoV-2 genome by the APOBEC (apolipoprotein B mRNA editing catalytic polypeptidelike) and ADAR (Adenosine Deaminase RNA Specific) family of enzymes respectively ( Figure 1A). APOBECs target single-stranded nucleic acids for deamination of cytosines into uracils (C-to-U) reflecting as a C-to-T transition in the reference genome. ADARs, on the other hand, deaminate adenines into inosines (A-to-I) on doublestranded RNA (dsRNA). Inosine preferentially base pairs with cytidine which leads to incorporation of guanine in subsequent replication and an A-to-G transition. Editing could also occur in the negative strand of SARS-CoV-2 genome that would reflect as G-to-A (APOBEC) and T-to-C (ADAR) changes in the reference genome. Both C-to-T/G-to-A and A-to-G/T-to-C combinations were present throughout the genome ( Supplementary Figure S3B) and contributed significantly to synonymous and missense changes, whereas only the former contributed to stop gains (Supplementary Figure S2F). We also performed qRT-PCR analysis of APOBEC3B and ADARB1 expression in Vero cells where we clearly observed high and constitutive ADARB1 expression in both mock as well as in SARS-CoV-2 infected Vero cells (n = 2), although APOBEC3B expression could not be captured (Supplementary Dataset S7). Further, we infected Vero cells with independent isolates of SARS-CoV-2 from 11 samples in the East India cohort. Comparing iSNVs at the 0th-passage and seveth-passage, we found significant accumulation of iSNVs in just seven passages (Supplementary Dataset S8). Interestingly, the relative pattern of iSNVs with respect to the nucleotide change remained similar (Supplementary Figure S6).

Contribution of iSNVs to SARS-CoV-2 diversity
To understand the contribution of iSNVs to the diversity of SARS-CoV-2 across different populations, we plotted a radial heatmap of the top 1% frequently observed genomic positions exhibiting a wide range of iSNV frequencies in major cohorts ( Figure 1C). Each concentric ring represents an iSNV frequency range of 0.2 and the different colour gradients depict various populations with percentage of samples at a given position in each cell. Thus, the innermost ring represents a cohort-wise share of samples which show variations in a small fraction of reads (≤0.2) at a given position whereas the outermost ring represents the ones where the variant seems to have achieved near fixation (>0.8) and would be reported as an SNV in the population. Most of these positions were observed to be shared between global populations and also within Indian subpopulations (Supplementary Figure S4). Interestingly, many samples carried iSNVs at positions which defined the B.1 lineage (C241T, C3037T, C14408T and A23403G) and the India specific B.6 lineage (C6310A, C6312A, G11083T, C13730T and C23929T) (Figure 2A) (41,42). Noteworthy, all B.1 lineagedefining variants depicted a C-to-T or an A-to-G nucleotide change.

Spatio-temporal dynamics of iSNVs
Though most of the sites seem to convert to a fixed variant in one or all populations in Figure 1C and Supplementary Figure S4, positions like 1707, 15435, 24622, 26554 and 29029 that depict an increasing trend in the iSNV frequency in few cohorts suggest that some iSNVs could get fixed in the population at a later date. We found significant overlap of iSNV sites with SNVs that have been reported in consensus sequences submitted in GISAID. ∼42% of the 16 410 unique iSNVs identified in the samples were recorded as SNVs in one or more samples submitted in GISAID by 30th September 2020 with the exact change in the nucleotide sequence, increasing to ∼80% by 30th June 2021 (Supplementary Dataset S5). As of June 2021, WHO had listed four lineages as variants of concern (VOC) and seven as variants of interest (VOI). ∼70% of all positions defining these lineages were recorded as iSNV sites in one or more samples including 62.  S4a-d). We found that at majority of the Delta and Kappa lineage-defining positions, SNVs appeared to coincide with or after the iSNVs first surfaced in the population ( Figure  2B). This further strengthens the postulate that iSNVs may introduce SNVs into the travelling viral strains in a population which may beget newer lineages with heightened properties of transmissibility and virulence as seen with the Delta variant (43).

Potential functional consequence of hyper-variation on Spike protein
We observed a few samples in each cohort to be hypervariable as reflected by an excessive number of iSNVs. In order to identify these population outliers, we computed the Z-score values based on the distribution of the number of iSNVs per sample ( Figure 3A). In these samples, we observed A-to-G substitutions accounting for more than one-third of all changed genomic positions implying extensive ADAR mediated editing activity ( Figure 3B) (44). Gto-T and C-to-A changes together accounted for ∼37% of iSNV sites suggesting these samples to also be substrates for potential reactive oxygen species (ROS) events (15). Apart from such host-driven factors, noise introduced by sequencing or PCR errors may also contribute to these outliers.
To understand how iSNVs could impact function through amino acids substitutions, we examined the mutational spectra of the Spike protein in more detail, given that the initial host-viral response is triggered by the attachment of the Spike protein of SARS-CoV-2 with the host ACE2 receptor, disrupting the host cell membrane and activating viral entry (45). Of the 6356 annotated variants in hypervariable samples, 909 correspond to protein-coding variants within the Spike protein. Amongst the 909 iSNVs, 628 were found to be non-synonymous, 184 synonymous and 67 stop-coding. Figure 3C depicts the frequency of these sites on different functional protein domains, including the critical receptor-binding domain (RBD). The most frequently altered variants were D614, Y91, D80, I105 and I468. The D614G mutation (one of the B.1 lineage defining variants) is near the S1/S2 cleavage site and has already been shown to be more geographically spreading than the D614 type (46).
We looked at other Spike mutations in more detail with respect to their functional consequences. Most striking, some amino acid sites within Spike seem to evolve into more than one type of variant. In Figure 3D, we depict these sites with residues G1171, K1255, P579, I68, S691, T716, Q779, A783, E918 evolving into either of >3 amino acid substitutions. The conservation of each residue, i.e. how variable that site is amongst closely related coronavirus species was also calculated. For instance, the HR1 region and residues near the transmembrane are observed to be more conserved while NTD and RBD domains have highly variable regions, in that, especially the receptor binding motif (RBM) (residue 437-508) seems to be less conserved ( Figure 3E). Interestingly, RBM was also found to be variable in our study where specific amino acid at 468 conferred variations in ∼8 out of 22 hyper-variable samples. The highly conserved amino acids represent a similar function of the protein across species but the substitution of crit-ical amino acids around functional sites like RBM might lead to evolved or newer biological activities.
In a recent study, accelerated evolution of the virus over time leading to a repeated resurgence in an immunocompromised infected patient has been reported (47). Genomic analysis revealed that the individual had not been infected multiple times. Instead, the virus had lingered and rapidly mutated in the body leading to non-synonymous changes predominantly in the Spike region. The Spike protein and RBD harboured 57% and 38% of the total changes despite occupying only 13% and 2% of the virus genome respectively. The changes also included Q493K and F486I that corroborated with an earlier study that had recognised additional mutations (N74K, F79I, T259K, K417E, K444Q, V445A, N450D, Y453F, L455F, E484K, G485D, F490L, F490S, H655Y, R682Q, R685S, V687G, G769E, Q779K and V1128A) to be implicated in antibody resistance (48). Besides, N439K mutation in Spike has also been shown to be involved in immune escape from a panel of neutralizing monoclonal antibodies in another study (49). 20 out of these 23 positions overlap with sites that have been shown to carry iSNVs in our samples (Supplementary Dataset S10).

DISCUSSION
The course of the current pandemic has seen huge efforts in sequencing SARS-CoV-2 genomes from COVID-19 diag-nosed patients and has helped catalogue the spread of mutations since the initial strain reported from Wuhan back in December 2019. Genomic surveillance has become an essential aid to epidemiological models and has facilitated prior predictions of variation(s) of concern through observations of inter-host differences in the viral genome over a period of time. In this study, we explored intra-host variability in the SARS-CoV-2 genome at spatio-temporal scales as an extended tool for genomic surveillance and examined factors contributing to its prevalence and its downstream effects.
intra-host Single Nucleotide Variations (iSNVs) can alter the fitness of the strains by influencing its virulence, infectivity or transmissibility properties which may be both beneficial or detrimental to the virus and lead to fixation or removal of alleles from the populations. Of the 16 410 unique iSNVs recorded in samples collected latest by June 2020, ∼42% and ∼80% had been recorded as an SNV in samples submitted in GISAID till 30 September 2020 and 30 June 2021 respectively (Supplementary Dataset S5). Given the global spread of the infection, even within the relatively small subset of samples analysed in this study from select population groups which were collected in the first few months of the pandemic, ∼70% of all the variants listed by the WHO as VOCs or VOIs by June 2021 had been recorded as an iSNV site in one or more samples (Supplementary Dataset S9). Recent studies have also suggested that the intra-host heterogeneity of SARS-CoV-2 genome in individuals may be transferable (50)(51)(52), and given that a minuscule percent of population is responsible for most of the local transmission of SARS-CoV-2 infection (53)(54)(55), there arises a possibility of iSNVs harbouring genome to be passed through a super-spreading event (56). This, in an isolated population, might also lead to an altered haplotype with some unique variations leading to fixation or removal of alleles from the population (57). By temporally capturing iSNVs and the consequential variant appearance in the population, we could determine iSNV sites in the East Indian population that possibly transitioned into SNVs (Supplementary Figure S5). Analysing samples sequenced in India between November 2020 to May 2021, we found majority of Delta (B.1.617.2) and Kappa (B.1.617.1) lineagedefining positions to carry iSNVs before the incidence of the subsequent SNV in the population ( Figure 2B). This indicates that conjoint analysis of iSNVs with such gain and loss of variants in spatio-temporal scales might provide some clues of the evolvability of sites and may prove to be an important asset in genomic surveillance. Moreover, cooccurrence of such events in distinct isolated populations may result in the independent origin of lineages with similar properties.
Although the extent of variability differs amongst individuals, there seem to be preferred iSNV sites that are shared across populations ( Figure 1C, Supplementary Figure S4) (58,59). By the end of 2020, the B.1 lineage had become the most prevalent in most parts of the world with near disappearance of other lineages which were once the predominant strain in certain regions. For example, the B.6 lineage which was reported to be primarily endemic to India (41,42) in April 2020 had disappeared from the entire population by October 2020 (Figure 2A). Given that all B.1 lineage-defining positions (241, 3037, 14408 and 23403) are iSNV sites with C-to-T and A-to-G changes ( Figure 1C), host-mediated RNA editing might have fueled the positive selection of the B.1 lineage. The APOBEC and ADAR family of enzymes are extensively reported to edit different families of viral genomes (60)(61)(62)(63)(64) including the coronaviruses (65). SARS-CoV-2 being zoonotic and of recent origin, seems to harbour a large number of sites that could be potentially targeted by the host editing machinery. The signatures of these enzymatic activities seem to be reflected in the SARS-CoV-2 genomes ( Figure 1A) (14,15), supported by high expression of ADARB1 revealed by RT-qPCR of both uninfected and SARS-CoV-2 infected Vero cells (Supplementary Dataset S7). Although, A-to-G changes were overrepresented in a few samples, APOBEC-mediated Cto-T changes were more consistently observed in our samples ( Figure 1A) which may explain the overrepresentation of C-to-T SNVs in the SARS-CoV-2 populations (66,67). Besides, we also observed substantial abundance of G-to-T and C-to-A substitutions in some samples. These might be due to Reactive Oxygen Species (ROS), as hypothesized in a recent study (15). ROS activity oxidizes guanine to 7,8-dihydro-8-oxo-2 -deoxyguanosine (oxoguanine) that base pairs with adenine, leading to G-to-T transversions. This change on the negative strand would be reflected as a C-to-A transversion in the reference genome. ROS also has its role implicated in mutagenesis of many other virus families (68).
Low fidelity of RNA-dependent RNA polymerase (RdRp) is known to be a major driver of intra-host heterogeneity of RNA viruses. Correlating the number of iSNVs against mutations in RdRp revealed differential prevalence of iSNVs between the wild-type and C14408T and C13730T mutant RdRp in samples from India ( Figure 1B), indicating positive selection of a more error-prone RdRp to maintain an active buffer of minor variants. The difference in the prevalence of iSNVs could also arise out of varied RNA editing across populations ( Figure 1B), highlighting the active role of the host. The editing enzymes have been shown to display genetic variability in populations, perhaps a consequence of selection based on pathogenic loads (69,70). Transmission of viruses through different host populations could thus shape the mutational landscape and govern the evolution of the viral genomes.
Ongoing selection in the host may contribute to the fixation of some of the sites as variants in SARS-CoV-2 genomes with potential consequences on the functioning of the viral proteins. Structural analysis of iSNVs that occur in Spike shows that >60% of these variants result in a substantial change in the property of these amino acids (Supplementary Table S1). Changes in specific amino acids, especially mutations close to functional sites, may lead to drastic changes in protein structure as they can either introduce significant change in the sidechain or charge of the residue. Also, it has been shown that APOBEC influenced C-to-T substitutions elevate the frequency of hydrophobic amino acid coding codons in SARS-CoV-2 peptides (71). We also mapped 140 iSNVs onto the protein structure by utilizing a full-length Spike protein model of closed conformation ( Figure 3F). Some iSNVs are located on the surface of the receptor-binding domain (RBD) which is responsi-ble for binding to the ACE2 receptor in the host cell. These would directly impact the binding of Spike protein to the receptor and might have a putative effect on the binding interface residues shown in surface representation ( Figure 3G). Noteworthy, we observed iSNV sites at 20 out of 23 mutations in the Spike protein that have been recently reported to confer antibody resistance (Supplementary Dataset S10). These mutations can have major implications in vaccine response as they could alter the immunogenicity of the antigenic RBD peptide leading to differential antibody titers in infected individuals (72). Overall, our findings indicate a comprehensive survey for one of the SARS-CoV-2 proteins and we propose that some of these distinct patterns of sites in hyper-variable samples may directly interfere with specific functional output.
In conclusion, temporally tracking within-host variability of the virus in individuals and populations might provide important leads to the sites that are favourable or deleterious for virus survival. This information would be of enormous utility for predicting the spread and infectivity of viral strains in the population. Conjoint analysis with the intrahost variability of the SARS-CoV-2 genome should be the next step.

DATA AVAILABILITY
The commands used in the pipeline, automated codes, detailed individual sample information and statistics are available at the following GitHub repository --https://github. com/pxthxk/iCoV19.