Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) Sequence Characteristics of Coronavirus Disease 2019 (COVID-19) Persistence and Reinfection

Abstract Background Both severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) reinfection and persistent infection have been reported, but sequence characteristics in these scenarios have not been described. We assessed published cases of SARS-CoV-2 reinfection and persistence, characterizing the hallmarks of reinfecting sequences and the rate of viral evolution in persistent infection. Methods A systematic review of PubMed was conducted to identify cases of SARS-CoV-2 reinfection and persistence with available sequences. Nucleotide and amino acid changes in the reinfecting sequence were compared with both the initial and contemporaneous community variants. Time-measured phylogenetic reconstruction was performed to compare intrahost viral evolution in persistent SARS-CoV-2 to community-driven evolution. Results Twenty reinfection and 9 persistent infection cases were identified. Reports of reinfection cases spanned a broad distribution of ages, baseline health status, reinfection severity, and occurred as early as 1.5 months or >8 months after the initial infection. The reinfecting viral sequences had a median of 17.5 nucleotide changes with enrichment in the ORF8 and N genes. The number of changes did not differ by the severity of reinfection and reinfecting variants were similar to the contemporaneous sequences circulating in the community. Patients with persistent coronavirus disease 2019 (COVID-19) demonstrated more rapid accumulation of sequence changes than seen with community-driven evolution with continued evolution during convalescent plasma or monoclonal antibody treatment. Conclusions Reinfecting SARS-CoV-2 viral genomes largely mirror contemporaneous circulating sequences in that geographic region, while persistent COVID-19 has been largely described in immunosuppressed individuals and is associated with accelerated viral evolution.

infection, especially in immunosuppressed individuals. Like reinfection cases, persistent COVID-19 can also span the range of disease severity, from asymptomatic to severe disease, and recurrent symptoms can last for months [8][9][10][11]. Differentiating between persistence and reinfection can be challenging, and little is known about differences in the location and quantity of SARS-CoV-2 mutations in these scenarios. We performed an analysis of SARS-CoV-2 sequences from published cases of COVID-19 reinfection and persistence, characterizing the hallmarks of reinfecting sequences and the rate of viral evolution in persistent infection.

Data Search and Selection Criteria
We conducted a systematic literature review in PubMed through 8 March 2021 for cases of persistent COVID-19 using the search term "((covid or sars-CoV-2) AND (persistent or persistence or prolonged)) AND (sequence or evolution). " A search for COVID-19 reinfection reports was made using the terms "(covid or sars-CoV-2) AND (reinfection). " Both peer-reviewed and preprint results were evaluated. We used the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) for reviewing literature and for reporting search results. Additional preprints that appeared through Google search and that met our criteria were also included. For cases of reinfection, papers were included if the authors described it as a case of reinfection diagnosed more than 30 days after the initial infection and if whole-genome SARS-CoV-2 sequences or sites of mutations relative to a reference sequence (eg, Wuhan-Hu-1) from both infection time points were available. Of the 291 results from the search, 14 articles met the inclusion criteria and were included in the present report along with 2 additional preprints that were identified (Supplementary Figure 1A).
Persistent cases were included if the authors described it as a case of persistent COVID-19 infection and if longitudinal whole-genome SARS-CoV-2 sequences were available. The search returned 129 results, 7 of which met the inclusion criteria and were included in the present report along with 1 other preprint (Supplementary Figure 1B). Only sequences obtained directly from patient respiratory tract samples were included in our analysis to exclude the possibility of sequence changes during the ex vivo culture process. Three cases were excluded due to uncertainty in their classification as either reinfection or persistent infection cases (Supplementary Methods,  Supplementary Table 1, Supplementary Figure 2).
Sequences were analyzed for mutations using Nextclade (https://clades.nextstrain.org/) and snp-sites (https://github. com/sanger-pathogens/snp-sites). The degree of reinfection severity, either more or less severe compared with the first infection, was classified based on an explicit determination by the authors of each article or by comparing symptoms, duration of illness, and hospitalization status between both episodes.

Sequencing Dataset Compilation and Phylogenetic Tree Construction
The sequencing dataset contained a total of 262 globally representative SARS-CoV-2 genomes selected from GISAID (https://www.gisaid.org/) and sequences from the reinfection and persistence cases (Supplementary Methods, Supplementary Data 1). The sampled sequences were chosen to be representative of global sequence diversity throughout the time course of the pandemic. Sequences of variants of concern B.1.1.7 and B.1.351 were also included. Nucleotide sequence alignment was performed using MAFFT (Multiple Alignment using Fast Fourier Transform) [12]. Best-fit nucleotide substitution was calculated using model selection followed by maximum likelihood (ML) phylogenetic tree construction using IQ-Tree with 1000-bootstrap replicates [12].

Mutation Analysis
For reinfection cases, mutations were determined in 2 ways. First, nucleotide and amino acid changes were identified for the reinfection sequences relative to the first infection sequence. The frequency of nucleotide or amino acid changes within each gene was compared with the frequency of changes in the remainder of the genome by Fisher's exact tests with a Bonferroni correction (for multiple comparisons). The relationship between disease severity and number of nucleotide or amino acid changes in the genome was assessed using a Mann-Whitney test. Second, to identify unique characteristics of reinfecting viruses, each of the first and reinfection sequences were compared with circulating sequences in the community as defined by the same Nextstrain clade sampled within 1 month obtained from the same geographic location uploaded to GISAID (Supplementary Table 2, Supplementary Methods, Supplementary Data 2). Rare mutations were determined as polymorphisms that were present only in the reinfecting sequence (not the initial variant) and found in less than 1% of contemporaneous community sequences. Mutation locations are graphically represented in Circos plots [13].
For persistent infections, sequence changes were assessed at 2 time intervals: before or after convalescent plasma or monoclonal antibody treatment. Sequences sampled before convalescent plasma or antibody treatment were compared with the first sequence sampled. For sequences sampled after convalescent plasma or antibody treatment, sequence changes (both nucleotide and amino acid) were determined relative to the last pretreatment sequence. Linear regression was used to estimate the rate of viral changes between 2 intervals. The slope of the trendline was compared with the latest global clock rate (29 March 2021) as estimated by NextStrain (https://nextstrain.org/ ncov/global/).

Time-Measured Phylogenetic Analysis
The temporal signal of the ML tree was examined in TempEst [14] regressing on root-to-tip divergence, and outliers were inspected in the distribution of residuals. A high degree of clock-like behavior in the whole dataset was observed (R 2 = 0.721) in root-to-tip regression analysis with the slope rate as 8.26E-4 and the rough ancestral time of the sample was calculated as 2019.84. This suggests that the whole dataset has a realistic temporal signal and is appropriate for an estimation of temporal parameters. No outliers were found in this sample. To further examine the temporal signal in the sequences from persistent patients (especially those with >2 sequences), separate root-to-tip regression analysis also supported temporal signal for a time-measured phylogeny. To compare the evolutionary rates between the reported persistent infections and the general population infections, time-measured phylogenetic reconstruction was conducted in Bayesian Evolutionary Analysis Sampling Trees (BEAST) version 1.10.4 [15]. Nine partitions, including 8 persistent patients and the global sequences, were used as separate groups of taxa, to estimate separate evolutionary rates. Due to large uncertainties with small samples, persistent patients with only 2 viral sequences were excluded from this analysis. A general time reversible (GTR) model was applied with gamma-distributed rate variations among sites. A log-normal relaxed molecular clock was used with an initial mean of 0.0008 and a uniform prior ranging from 0.0 to 1.0. A logistic growth tree prior was applied. Four independent Bayesian Markov Chain Monte Carlo (MCMC) chains of 100 million generations were performed with a sampling step every 10 000 generations to yield 10 000 trees per run. To ensure a sufficient effective sample size greater than 200, the convergence of 3 runs was diagnosed in Tracer version 1.7.1 (http://tree.bio.ed.ac.uk/software/tracer/) for all parameters. LogCombiner version 1.10.4 as part of the BEAST software package was used to combine the multiple runs to generate log and tree files after appropriate removal of the burn-in from each MCMC chain. The comparison of the evolutionary rates from the combined log file was analyzed and visualized in R version 4.0.2 (https://www.r-project.org/).

Statistical Analysis
Nonparametric Wilcoxon rank-sum or matched-pairs signedrank tests were used to compare the number of amino acid changes between sequences. Statistical analyses were performed using GraphPad Prism 9 (GraphPad Software, San Diego, CA).

Sequence Analysis of Reinfection Cases
A total of 20 cases from 16 reports were included in this analysis (Table 1) [2][3][4][5][6][7][16][17][18][19][20][21][22][23][24][25]. A broad range of age groups were represented and 90% were under the age of 70 years. Most (80%) of the cases had no reported comorbidities, and while 1 patient had diabetes and end-stage renal disease, none had high-level immunosuppression. The interval between diagnosis of the first infection and the second infection ranged from 44 days to 282 days with a median of 113.5 days. Five patients had more severe illness during the second infection, while 6 had less severe symptoms on reinfection, including 2 who were asymptomatic on reinfection. Two cases were asymptomatic in both infections, 5 cases reported the same severity for both infections, and no information on infection severity was available for 2 cases (Table 1). Six cases reported reinfection with a virus from the same clade. Phylogenetic analysis demonstrated distinct branching for the 2 sequences in each of the reinfection cases, corroborating results discussed in the original reports (Figure 1). We compared nucleotide and amino acid changes in the reinfecting viral sequence with the initial sequence and found a median of 17.5 nucleotide changes (range: 9-37) and 9 amino acid changes (range: 6-24) compared with the original sequence (Figure 2A). The nucleotide changes between the initial and reinfecting sequences were distributed across the SARS-CoV-2 genome, with significantly higher frequencies of changes in open reading frame (ORF) 8 (ORF8) (P < .001) and N (P = .001) ( Figure  2B). A similar pattern was observed with amino acid changes (Supplementary Figure 3A). All but 2 reinfection cases had at least 1 substitution or deletion in the S gene (Supplementary  Table 3). Next, we assessed whether reinfection with a more divergent second virus resulted in more severe disease. We found no significant differences in the number of nucleotide or amino acid changes in the reinfecting virus compared with the original viral variant when categorized by the severity of the reinfection ( Figure 2C, Supplementary Figure 3B). Both the initial and reinfecting SARS-CoV-2 variants were similar to the sequences circulating in the community at the time of reinfection. The initial infecting variant harbored a median of only 2 rare nucleotide mutations compared with contemporaneous circulating variants in the community and the reinfecting variant contained a median of only 1 rare nucleotide mutation ( Figure 2D-E, Supplemental Figure 3C).

Sequence Analysis of Persistent COVID-19 Cases
A total of 9 cases from 7 reports describing persistent infection were retrieved from our literature search. Of these 9 cases, all but one had B-cell immunodeficiency [8-10, 11, 26-28]. Four were treated with B-cell-depleting therapy for lymphoma or autoimmune disorders, while 4 had B-cell lymphomas treated with chemotherapy (Table 2). One patient had advanced human immunodeficiency virus (HIV) infection with a CD4+ count of 0 cells/mm 3 and diminished CD19+ cell counts. The median length of infection was 154 days and 33% of the cases ended in death. One patient had asymptomatic disease throughout [9]. Four patients were treated with convalescent plasma at least once during their illness [9,10,11,27], and 1 patient was treated with the monoclonal antibodies casirivimab and imdevimab [8].
Phylogenetic analysis revealed that, for each of the 9 patients, sequences formed a distinct cluster, confirming what was found in the original reports ( Figure 1). New mutations emerging over time were detected in all of the patients with persistent COVID-19, with further changes identified after treatment with convalescent plasma or monoclonal antibodies (Supplementary Figure 4). Mutations occurred with significantly higher frequency in S (P < .001) and ORF7a (P = .02) and lower frequency in ORF1a (P = .02) ( Figure 3A, Supplementary Figure 5A). The rate of viral evolution was plotted for each patient both for the interval before and after convalescent plasma/ antibody treatment. Before antiviral treatment, the rate of sequence changes over time appeared faster than the Nextstrain estimate for the global rate of SARS-CoV-2 evolution (dotted purple line in Figure 3B, Supplementary Figure 5B). Treatment with convalescent plasma or antibody cocktail was insufficient to halt intrahost viral evolution ( Figure 3C, Supplementary Figure 5C).
We also performed time-measured phylogenetic reconstruction with the pretreatment persistent sequences to compare the rate of intrahost viral evolution in persistent COVID-19 with the rate of community-driven evolution. This analysis provided further evidence that SARS-CoV-2 evolution appeared faster in these persistent-infection individuals compared with the rate in the general public population, although substantial uncertainties are shown in these estimates given the limited sequence sampling in each patient ( Figure 3D, Supplementary Table 4). DISCUSSION We conducted a systematic review and pooled analysis of sequences from reports of COVID-19 reinfection and persistent infection. Reports of reinfection cases demonstrate a wide range of situations, spanning a broad distribution of ages, baseline health status, and reinfection severity compared with the initial infection. Reinfection occurred as early as 1.5 months or more than 8 months after the initial infection. Common explanations for the presence of reinfection involves either waning SARS-CoV-2 antibodies or the presence of viral escape mutations [29,30]. While most cases of SARS-CoV-2 reinfection did involve infection with a different clade (including the variants of concern B.1.1.7 and P.1), it is noteworthy that mutations were identified throughout the genomes and the frequency of mutations within the S gene was not elevated relative to the rest of the genome. In addition, individuals with more severe reinfections did not have significantly greater frequency of S gene mutations. Interestingly, the genes with the highest frequency of mutations were ORF8 and N. ORF8 is a rapidly evolving accessory protein that may antagonize host immune function [31], while the nucleocapsid is a vital structural protein that also serves as a target for both humoral and cell-mediated immune responses [32]. Finally, the presence of rare mutations was uncommon in the reinfecting virus, which largely mirrored the contemporaneously circulating variants in the region of infection. However, the reinfecting variants generally contained a substantial number of mutations compared with the initial variant, including frequent changes in the S gene, and additional studies are needed to assess whether these changes may have contributed to the risk of repeat infection. While the number of immunosuppressed individuals with available sequences remains limited, the results suggest that the rate of viral evolution (measuring both synonymous and nonsynonymous changes) is accelerated within immunosuppressed individuals. In addition, treatment with convalescent plasma or monoclonal antibody cocktails was insufficient to fully halt viral evolution and the emergence of viral escape with treatment has been documented [11,33]. Mutations associated with immune escape and/or more efficient replication kinetics, including E484K, S494P, N501Y, and N-terminal spike deletions, have been observed in both immunosuppressed individuals and the novel variants of concern [34,35]. The results raise the possibility that novel variants, including those harboring escape mutations against current treatments, could arise from immunosuppressed individuals and suggest that immunosuppressed individuals should be a focus of public health efforts. Among the current reports of persistent COVID-19, B-cell dysfunction appears to be a common thread, including in reports that were not included in this analysis due to a lack of available full-length sequences [36][37][38][39][40]. It is important to note, however, that T-cell function may also play a role in protection against SARS-CoV-2 [41] and a subset of these patients also included concurrent suppression of other aspects of the immune response. Additional studies are needed to fully define the type and intensity of immunosuppression that would place patients at greatest risk of persistent COVID-19.
Two factors generally differentiated between reinfection and persistent infection scenarios: first, reinfections have so far been largely described in immunocompetent individuals while the majority of persistent COVID cases have been in immunosuppressed patients. Second, phylogenetic analysis can generally differentiate between reinfection and persistent infection, especially in cases where persistent infection allowed the longitudinal collection of more than 2 sequences. However, given the slow rate of SARS-CoV-2 evolution and limited viral diversity [42], it can be challenging to differentiate between reinfection and persistent infection, especially in situations with limited sampling and/or duration between samples.
A limitation of this work is that it relies on case reports, which can be influenced by publication bias and limits our statistical power. However, to date, there have been no systematic, large-scale, sequence-based studies of COVID-19 reinfection or persistent infections. This is partly due to the rarity of these types of cases and that initial infecting sequences are frequently unavailable for comparison with reinfecting or persistently infecting variants. Overall, our results demonstrate the need to further explore factors that increase the risk of breakthrough reinfections and persistent COVID-19. This line of investigation will have important implications on the durability of currently available vaccines and for preventing the rise of novel variants.

Supplementary Data
Supplementary materials are available at Clinical Infectious Diseases online. Consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author.