Abstract

Background

A novel coronavirus (CoV), severe acute respiratory syndrome (SARS)–CoV-2, has infected >75 000 individuals and spread to >20 countries. It is still unclear how fast the virus evolved and how it interacts with other microorganisms in the lung.

Methods

We have conducted metatranscriptome sequencing for bronchoalveolar lavage fluid samples from 8 patients with SARS–CoV-2, and also analyzed data from 25 patients with community-acquired pneumonia (CAP), and 20 healthy controls for comparison.

Results

The median number of intrahost variants was 1–4 in SARS–CoV-2–infected patients, ranged from 0 to 51 in different samples. The distribution of variants on genes was similar to those observed in the population data. However, very few intrahost variants were observed in the population as polymorphisms, implying either a bottleneck or purifying selection involved in the transmission of the virus, or a consequence of the limited diversity represented in the current polymorphism data. Although current evidence did not support the transmission of intrahost variants in a possible person-to-person spread, the risk should not be overlooked. Microbiotas in SARS–CoV-2–infected patients were similar to those in CAP, either dominated by the pathogens or with elevated levels of oral and upper respiratory commensal bacteria.

Conclusion

SARS–CoV-2 evolves in vivo after infection, which may affect its virulence, infectivity, and transmissibility. Although how the intrahost variant spreads in the population is still elusive, it is necessary to strengthen the surveillance of the viral evolution in the population and associated clinical changes.

(See the Editorial Commentary by Wertheim on pages 721–2.)

Since the outbreak of a novel coronavirus, severe acute respiratory syndrome (SARS)– coronavirus (CoV) 2, in Wuhan, China, the virus had spread to >20 countries, resulting in >75 000 cases and >2300 deaths (until 22 February 2020) [1, 2]. The basic reproduction number was estimated to range from 2.2 to 3.5 at the early stage [3], making it a severe threat to public health. Recent studies have identified bats as the possible origin of SARS–CoV-2, and the virus likely uses the same cell surface receptor as SARS–CoV [4], namely, angiotensin-converting enzyme 2. These studies have advanced our understanding of SARS–CoV-2. However, our knowledge of this virus is still limited.

The virus undergoes a strong immunologic pressure in humans, and it may thus accumulate mutations to outmaneuver the immune system [5]. These mutations could result in changes in viral virulence, infectivity, and transmissibility [6]. Therefore, it is imperative to investigate the pattern and frequency of mutations occurred. Aside from the pathogen, the lung microbiota is associated with disease susceptibility and severity [7]. Alterations in lung microbiota could potentially modify immune response against the viral and secondary bacterial infection [8, 9]. Thus, understanding the microbiota, which comprises bacteria that could cause secondary infection or exert effects on the mucosal immune system, might help predict the outcome and reduce complications.

In the current study, we conducted metatranscriptome sequencing on bronchoalveolar lavage fluid (BALF) samples from 8 patients with CoV disease 2019 (COVID-19, the disease caused by SARS–CoV-2). We found that the number of intrahost variants ranged from 0 to 51 with a median number of 4, suggesting a high evolution rate of the virus. By investigating a possible person-to-person spread event, we found no evidence for the transmission of intrahost variants. Meanwhile, we found no specific microbiota alteration in BALF samples from patients with COVID-19, compared with patients with community-acquired pneumonia (CAP) with other suspected viral causes.

METHODS

Patients, Controls, and Sample Collection

Eight COVID-19 pneumonia samples were collected from hospitals in Wuhan on December 30, 2019 through January 1, 2020; 25 viruslike CAP samples were collected from Beijing Peking University People’s Hospital, the Shenzhen Third People’s Hospital, Fujian Provincial Hospital, and the First Affiliated Hospital of Xi’an Jiaotong University between 2014 and 2018. CAP was diagnosed following the guidelines of the Infectious Diseases Society of America and the American Thoracic Society [10]. Patients with pneumonia who have chronic pulmonary diseases were excluded. Meanwhile, BALF samples from 20 healthy volunteers were collected and used as healthy controls. Demographic information and clinical information were included in Supplementary Table 1.

For each patient, BALF samples were collected using a bronchoscope as part of normal clinical management. The volume of the samples ranged from 5 to 30 mL; most were used for bacterial culture, and the rest were aliquoted and stored at ‒80ºC before processing.

Metatranscriptome Sequencing

A 200-μL aliquot of each SARS–CoV-2–infected whole-BALF sample was used to extract RNA, using a Direct-zol RNA Miniprep kit (Zymo Research) and Trizol LS (Thermo Fisher Scientific) in a biosafety level 3 laboratory, and the rest of the samples were treated according to the same protocol in a biosafety level 2 laboratory. The RNA was then reverse-transcribed, amplified using an Ovation Trio RNA-Seq library preparation kit (NuGEN). and was sequenced on an Illumina HiSeq 2500/4000 platform (Illumina).

Data Availability

The viral sequencing data reported in this article have been deposited in the Genome Warehouse in the National Genomics Data Center [11] (under project PRJCA002202, publicly accessible at https://bigd.big.ac.cn/gsa). Meanwhile, the data have also been submitted to the National Center for Biotechnology Information Sequence Read Archive database (project PRJNA605907). The data of CAP and healthy controls was kindly shared by our collaborators, the microbiota composition data is available from the corresponding author on request.

Data Processing and Taxonomic Assignments

Quality control processes included adapter trimming, low-quality reads removal, and short reads removal with fastp software (-l 70, -x, -cut-tail, -cut_tail_mean_quality 20; version 0.20.0) [12], low-complexity reads removal with Komplexity software (-F, -k 8, -t 0.2; version November 2019) [13], host removal with bmtagger software (ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/bmtagger) [14], and ribosomal reads removal with SortMeRNA software (version 2.1b) [15].

The resultant reads were mapped against the National Center for Biotechnology Information nucleotide database (version 1 July 2019) using BLAST+ software (version 2.9.0) (-task megablast, -evalue 1 × 10-10, -max_target_seqs 10, -max_hsps 1, -qcov_hsp_perc 60, -perc_identity 60) [16]. Taxonomic assignment was done with MEGAN software (version 6.11.0), using a lowest-common-ancestor algorithm (-ms 100, -supp 0, -me 0.01, -top 10, -mrc 60) [17]. After an overall principal coordinates analysis and permutational multivariate analysis of variance, samples and microorganisms were filtered for further analyses. Samples with <5000 microbial reads were discarded. Microorganisms satisfying the following criteria were considered in the microbiota analysis: (1) archaea, bacteria, fungi, or virus; (2) relative abundance ≥1% in the raw data and filtered data; (3) support by ≥100 reads; (4) abundance >10-fold that in the negative control (NC); (5) no batch effect; (6) abundance not negatively correlated with bacteria titer; and (7) no known contamination.

Detection of Intraindividual Variants

Clean reads were mapped to the reference genome of SARS–CoV-2 (GenBank MN908947.3), using BWA mem software (version 0.7.12) [18]. Duplicate reads were removed using Picard software (http://broadinstitute.github.io/picard; version 2.18.22). The mpileup file was generated using samtools software (version 1.8) [19], and intrahost variants were identified using VarScan software (version 2.3.9) [20] and in-house scripts.

All variants had to satisfy the following requirements: (1) sequencing depth ≥50, (2) minor allele frequency (MAF) ≥5%, (3) MAF≥2% on each strand, (4) minor allele count ≥5 on each strand, (5) minor allele supported by the inner part of the read (excluding 10 base pairs on each end), and (6) both alleles identified in ≥3 reads that specifically mapped to the genome of Betacoronavirus. For comparison with the polymorphism in the population, we obtained 110 sequences from the Global Initiative on Sharing All Influenza Data (GISAID) (www.gisaid.org) [21, 22]. The accession numbers and acknowledgments are included in Supplementary Table 2.

Statistical Analysis

Pearson × 2 or Fisher exact tests were used for categorical variables, and Mann-Whitney U or Kruskal-Wallis rank sum tests for continuous variables that do not follow a normal distribution. Microbiotas were compared using permutational multivariate analysis of variance.

Ethical Approval and Consents

This study was approved by the National Health Commission of China and Ethics Commission of the Jin Yin-tan Hospital of Wuhan (No. KY-2020-01.01). The requirement for written informed consent was waived given the context of emerging infectious diseases. The study of the CAP and healthy controls was approved by the Institutional Review Board of the Institute of Pathogen Biology, Chinese Academy of Medical Sciences & Peking Union Medical College (IPB-2018-3). Written informed consent was obtained from all pneumonia patients and healthy controls.

RESULTS

Data Summary

By metatranscriptome sequencing, >20 million reads were generated for each BALF sample from patients with COVID-19 (nCoV) as well as an NC (nuclease-free water). For comparison, the metatranscriptome sequencing data with similar number of reads from 25 patients with viruslike CAP (determined by ≥100 viral reads and 10-fold higher than those in the NC), 20 healthy controls without any known pulmonary diseases, and 2 extra NCs (2 saline solutions passing through the bronchoscope) were used in this study. Demographic and clinical information was collected and is summarized in Supplementary Table 1.

After quality control, a median number of 55 571 microbial reads was generated for each sample. nCoV samples had the highest proportion of microbial reads (median proportion, 7% for nCoV vs 0.8% for CAP and 0.1% for healthy controls; P < .001) (Figure 1A), and 49% of the total number of microbial reads could be mapped to SARS–CoV-2, which did not differ from the viral proportion in CAP (Figure 1B). SARS-CoV-2 was the only Betacoronavirus identified in nCoV samples. Moreover, besides the detection of human coronavirus OC43 (HCoV-OC43) in 1 healthy control and human coronavirus HKU1 (HCoV-HKU1) in 1 patient with CAP, no other samples showed any signal of Betacoronavirus, which proved the authenticity of the data and methods used in our analysis.

Overview of sequencing data. A, Proportions of microbial reads in different groups. *P < .05; †P < .01; ‡P < .001; §P < .0001. B, Proportions of viral reads in patients infected with different viruses. Abbreviations: CAP, community acquired pneumonia; HAdVs, human adenovirus; HKU1, human coronaviruse HKU1; HRV, human rhinovirus; MeV, measles virus; NC, negative control; RSV, respiratory syncytial virus; SARS–CoV-2, severe acute respiratory syndrome–coronavirus 2.
Figure 1.

Overview of sequencing data. A, Proportions of microbial reads in different groups. *P < .05; †P < .01; ‡P < .001; §P < .0001. B, Proportions of viral reads in patients infected with different viruses. Abbreviations: CAP, community acquired pneumonia; HAdVs, human adenovirus; HKU1, human coronaviruse HKU1; HRV, human rhinovirus; MeV, measles virus; NC, negative control; RSV, respiratory syncytial virus; SARS–CoV-2, severe acute respiratory syndrome–coronavirus 2.

Intrahost Variants in the Genome of Patients With SARS–CoV-2

The sequencing depth of SARS–CoV-2 ranged from 18-fold in patient nCoV5 to 32 291-fold in patient nCoV1, with >80% of the genome covered by ≥50-fold in 5 samples (Figure 2A and Supplementary Table 3). In total, 84 intrahost variants were identified with MAF >5%, and 25 with MAF >20% (Supplementary Table 4 and Figure 2B); the sample from patient nCoV5 was excluded from the analysis owing to large gaps in its genome coverage.

Intrahost variants in the severe acute respiratory syndrome–coronavirus 2 (SARS–CoV-2) genome. A, Genome coverage for SARS–CoV-2. Dashed line indicates the sequencing depth of 50. B, Frequency distribution of all intrahost variants; the frequencies of different mutations in polymorphism data are shown on the right. C, Distribution of the intrahost variations and polymorphisms on the SARS–CoV-2 genome. The outer ring displays the structure of the genome, followed by the distribution of polymorphisms on the genome. The length of each line represents the number of sequences with this mutation. Owing to a large variation in the number (1–27), 5% of the line length was added for each additional sequence. Inner rings represent the distribution of intrahost variants in different patients (rings are labeled with patient numbers). Red bars indicate synonymous mutations; blue bars, nonsynonymous mutations. D, Frequency of the mutant allele at each high-level (frequency ≥20%) intrahost variant position. Nucleotides at the position (reference allele/alternative allele), mutation type (nonsynonymous, synonymous, noncoding), gene name, amino acid change are shown to the right of the heat map, and the total number of variants (with frequency ≥20%) in each sample is shown above it. The patient/sample numbers for the 5 samples with >80% of the genome covered by ≥50-fold are highlighted in blue. Open circles represent samples with a sequencing depth <50-fold at this position.
Figure 2.

Intrahost variants in the severe acute respiratory syndrome–coronavirus 2 (SARS–CoV-2) genome. A, Genome coverage for SARS–CoV-2. Dashed line indicates the sequencing depth of 50. B, Frequency distribution of all intrahost variants; the frequencies of different mutations in polymorphism data are shown on the right. C, Distribution of the intrahost variations and polymorphisms on the SARS–CoV-2 genome. The outer ring displays the structure of the genome, followed by the distribution of polymorphisms on the genome. The length of each line represents the number of sequences with this mutation. Owing to a large variation in the number (1–27), 5% of the line length was added for each additional sequence. Inner rings represent the distribution of intrahost variants in different patients (rings are labeled with patient numbers). Red bars indicate synonymous mutations; blue bars, nonsynonymous mutations. D, Frequency of the mutant allele at each high-level (frequency ≥20%) intrahost variant position. Nucleotides at the position (reference allele/alternative allele), mutation type (nonsynonymous, synonymous, noncoding), gene name, amino acid change are shown to the right of the heat map, and the total number of variants (with frequency ≥20%) in each sample is shown above it. The patient/sample numbers for the 5 samples with >80% of the genome covered by ≥50-fold are highlighted in blue. Open circles represent samples with a sequencing depth <50-fold at this position.

Notably, the number of variants was not associated with the sequencing depth (Supplementary Figure 1). The overall Ka/Ks ratio (The ratio of the rate of nonsynonymous mutations to the rate of synonymous mutations) was significantly smaller than 1, which was similar for intrahost variants and the polymorphisms observed in the population data, suggesting a purifying selection acting on both types of mutations (Table 1). The numbers of variants observed in the gene were proportional to gene lengths (correlation between the gene length and the number of mutations in the gene, 0.950 for the intrahost variant and 0.957 for the polymorphisms; both P < .001). Although only a small fraction of the variants were observed in multiple patients (2 of 84 variants; Figure 2C), some positions were more prone to mutate or variants were transmitting in the population, such as position 10 779, where the mutant allele A was observed in all 7 patients, with frequencies ranging from 15% to 100% (Figure 2D).

Table 1.

The Number of Intrahost Variants and Polymorphisms in the Genome of Severe Acute Respiratory Syndrome–Coronavirus 2

Intrahost VariantsPolymorphisms
GeneGene Length (Base Pairs)NSSKa/KsaNSSKa/KsP Valueb
orf1a13 2033090.67634140.493c.63
orf1b8088850.3551080.277c>.99
S3822830.5991060.375.69
ORF3a828210.561420.561>.99
E22800NA00NA>.99
M66920NA110.318>.99
ORF618610NA00NA>.99
ORF7a36620NA30NA>.99
ORF8366220.228210.456>.99
N1260721.048570.214c.18
ORF1011700NA10NA>.99
Total29 13362220.578c70390.368c.16
Intrahost VariantsPolymorphisms
GeneGene Length (Base Pairs)NSSKa/KsaNSSKa/KsP Valueb
orf1a13 2033090.67634140.493c.63
orf1b8088850.3551080.277c>.99
S3822830.5991060.375.69
ORF3a828210.561420.561>.99
E22800NA00NA>.99
M66920NA110.318>.99
ORF618610NA00NA>.99
ORF7a36620NA30NA>.99
ORF8366220.228210.456>.99
N1260721.048570.214c.18
ORF1011700NA10NA>.99
Total29 13362220.578c70390.368c.16

Abbreviation: NA, not applicable.

aKa/Ks ratios (the ratio of the rate of nonsynonymous mutations to the rate of synonymous mutations) were calculated using KaKs_Calculator2.0 software (MS model) [23].

bSignificance of difference between Ka/Ks ratios for 2 types of mutation in the gene (Fisher exact test).

cKa/Ks differs significantly from 1 (P < .05).

Table 1.

The Number of Intrahost Variants and Polymorphisms in the Genome of Severe Acute Respiratory Syndrome–Coronavirus 2

Intrahost VariantsPolymorphisms
GeneGene Length (Base Pairs)NSSKa/KsaNSSKa/KsP Valueb
orf1a13 2033090.67634140.493c.63
orf1b8088850.3551080.277c>.99
S3822830.5991060.375.69
ORF3a828210.561420.561>.99
E22800NA00NA>.99
M66920NA110.318>.99
ORF618610NA00NA>.99
ORF7a36620NA30NA>.99
ORF8366220.228210.456>.99
N1260721.048570.214c.18
ORF1011700NA10NA>.99
Total29 13362220.578c70390.368c.16
Intrahost VariantsPolymorphisms
GeneGene Length (Base Pairs)NSSKa/KsaNSSKa/KsP Valueb
orf1a13 2033090.67634140.493c.63
orf1b8088850.3551080.277c>.99
S3822830.5991060.375.69
ORF3a828210.561420.561>.99
E22800NA00NA>.99
M66920NA110.318>.99
ORF618610NA00NA>.99
ORF7a36620NA30NA>.99
ORF8366220.228210.456>.99
N1260721.048570.214c.18
ORF1011700NA10NA>.99
Total29 13362220.578c70390.368c.16

Abbreviation: NA, not applicable.

aKa/Ks ratios (the ratio of the rate of nonsynonymous mutations to the rate of synonymous mutations) were calculated using KaKs_Calculator2.0 software (MS model) [23].

bSignificance of difference between Ka/Ks ratios for 2 types of mutation in the gene (Fisher exact test).

cKa/Ks differs significantly from 1 (P < .05).

The number of intrahost variants per individual showed a large variation (0–51 [median, 4] vs 0–19 [median, 1] for variants with MAF ≥5% vs MAF ≥20%), which could not be explained by the batch effect, coverage variance, or contamination (Supplementary Figure 1) (samples from patients nCoV1–4 were in one batch, with nCoV5–8 in another; most mutations were not observed in the population data). We also noted that the number of variations was not relevant to the days after symptom onset or the age of patients (Supplementary Figure 2). Collectively, we did not find any reason for the high level of variants in the sample from patient nCoV6 (51 variants). A larger population size is needed to investigate how frequent such outliers are, and whether they are associated with the level of host immune response or the viral replication rate. We also noted similar outliers for other viruses [24]. Of note, the origin of variants could be either mutation occurring in vivo after infection or multiple transmitted SARS–CoV-2 strains.

No Evidence for Transmission of Intrahost Variants Between Samples

Among the 8 patients with COVID-19, patients nCoV4 and nCoV7 were from the same household, with dates of symptom onset differing by 5 days; thus, transmission from patient nCoV4 to patient nCoV7 is highly suspected, especially considering that only patient nCoV4 had been to the Huanan seafood market in Wuhan, the starting point of the outbreak and the suspected source. First, the consensus sequence of the virus was the same for 2 samples, and none of the 4 intrahost variants meeting the selection criteria in patient nCoV4 were detected in patient nCoV7 (Table 2). We further expanded the investigation to all variants with MAF ≥2% and supported by ≥3 reads. By doing so, we detected 7 variants (of 25) shared between the 2 samples. However, the MAF in both patients nCoV4 and nCoV7 were similar to those in other samples, suggesting that these positions were either error prone or mutation prone; hence, they cannot support the transmission of these variants.

Table 2.

Allele Frequency Changes in Transmission of Severe Acute Respiratory Syndrome–Coronavirus 2 From Patient nCoV4 to Patient nCoV7

Patient nCoV4Patient nCoV7
PositionReferenceaAlternativebFrequencyReferenceaAlternativebFrequencyP Valuec
376d1191770.598900.000<.001
769777170.0211600.000>.99
20371496330.022800.000>.99
329022491120.0471700.000>.99
330615231370.0831700.000.39
33211232290.0231600.000>.99
4511685260.0371100.000>.99
4518710160.022800.000>.99
10 77114161850.1162430.111>.99
10 7731467480.0322410.040.56
10 7799874010.2891880.308.83
10 814581530.0841510.063>.99
11 387653150.022400.000>.99
13 6931237380.030900.000>.99
15 6821321460.034810.111.27
15 6851342470.034810.111.27
18 49917831080.0571900.000.62
18 6991013460.0431200.000>.99
21 641520240.044500.000>.99
22 2702282550.0242720.069.15
23 127d11511760.1331900.000.16
26 177492110.022600.000>.99
27 493d153515540.5034000.000<.001
28 253d36004870.1193400.000.03
29 39846711270.0264700.000.64
Total34 84239680.102421170.028<.001
Patient nCoV4Patient nCoV7
PositionReferenceaAlternativebFrequencyReferenceaAlternativebFrequencyP Valuec
376d1191770.598900.000<.001
769777170.0211600.000>.99
20371496330.022800.000>.99
329022491120.0471700.000>.99
330615231370.0831700.000.39
33211232290.0231600.000>.99
4511685260.0371100.000>.99
4518710160.022800.000>.99
10 77114161850.1162430.111>.99
10 7731467480.0322410.040.56
10 7799874010.2891880.308.83
10 814581530.0841510.063>.99
11 387653150.022400.000>.99
13 6931237380.030900.000>.99
15 6821321460.034810.111.27
15 6851342470.034810.111.27
18 49917831080.0571900.000.62
18 6991013460.0431200.000>.99
21 641520240.044500.000>.99
22 2702282550.0242720.069.15
23 127d11511760.1331900.000.16
26 177492110.022600.000>.99
27 493d153515540.5034000.000<.001
28 253d36004870.1193400.000.03
29 39846711270.0264700.000.64
Total34 84239680.102421170.028<.001

aNumber of reads supporting reference allele.

bNumber of reads supporting alternative (mutant) allele.

cSignificance of difference in allele frequency between patient nCoV4 and patient nCoV7 (Fisher exact test).

dIntrahost variant positions with minor allele frequency ≥5% and meeting selection criteria.

Table 2.

Allele Frequency Changes in Transmission of Severe Acute Respiratory Syndrome–Coronavirus 2 From Patient nCoV4 to Patient nCoV7

Patient nCoV4Patient nCoV7
PositionReferenceaAlternativebFrequencyReferenceaAlternativebFrequencyP Valuec
376d1191770.598900.000<.001
769777170.0211600.000>.99
20371496330.022800.000>.99
329022491120.0471700.000>.99
330615231370.0831700.000.39
33211232290.0231600.000>.99
4511685260.0371100.000>.99
4518710160.022800.000>.99
10 77114161850.1162430.111>.99
10 7731467480.0322410.040.56
10 7799874010.2891880.308.83
10 814581530.0841510.063>.99
11 387653150.022400.000>.99
13 6931237380.030900.000>.99
15 6821321460.034810.111.27
15 6851342470.034810.111.27
18 49917831080.0571900.000.62
18 6991013460.0431200.000>.99
21 641520240.044500.000>.99
22 2702282550.0242720.069.15
23 127d11511760.1331900.000.16
26 177492110.022600.000>.99
27 493d153515540.5034000.000<.001
28 253d36004870.1193400.000.03
29 39846711270.0264700.000.64
Total34 84239680.102421170.028<.001
Patient nCoV4Patient nCoV7
PositionReferenceaAlternativebFrequencyReferenceaAlternativebFrequencyP Valuec
376d1191770.598900.000<.001
769777170.0211600.000>.99
20371496330.022800.000>.99
329022491120.0471700.000>.99
330615231370.0831700.000.39
33211232290.0231600.000>.99
4511685260.0371100.000>.99
4518710160.022800.000>.99
10 77114161850.1162430.111>.99
10 7731467480.0322410.040.56
10 7799874010.2891880.308.83
10 814581530.0841510.063>.99
11 387653150.022400.000>.99
13 6931237380.030900.000>.99
15 6821321460.034810.111.27
15 6851342470.034810.111.27
18 49917831080.0571900.000.62
18 6991013460.0431200.000>.99
21 641520240.044500.000>.99
22 2702282550.0242720.069.15
23 127d11511760.1331900.000.16
26 177492110.022600.000>.99
27 493d153515540.5034000.000<.001
28 253d36004870.1193400.000.03
29 39846711270.0264700.000.64
Total34 84239680.102421170.028<.001

aNumber of reads supporting reference allele.

bNumber of reads supporting alternative (mutant) allele.

cSignificance of difference in allele frequency between patient nCoV4 and patient nCoV7 (Fisher exact test).

dIntrahost variant positions with minor allele frequency ≥5% and meeting selection criteria.

Meanwhile, among all 84 intrahost variants, only 3 were found to be polymorphic in the population data (position 7866 G/T, 27 493 C/T, 28 253 C/T). This small number of overlap also suggests that intrahost variants were rarely transmitted to other samples. However, we cannot rule out the possibility that the sequence diversity in the population is underestimated by the current database.

Missing Microbiota Signature Associated With SARS–CoV-2 Infection

Metatranscriptome data also enabled us to profile the transcriptionally active microbiota in different types of pneumonia, which is associated with the immunity response in the lung [25, 26]. In general, a significant difference in microbiota composition was observed among the nCoV, CAP, and healthy groups (R2 = 0.07; P = .001) (Figure 3A). However, the clustering of some samples with the NC indicated a barren microbiota in some samples. After removing the problematic samples and ambiguous components, we still found that nCoV and CAP samples both differed from the healthy control samples (nCoV vs healthy controls, R2 = 0.45 and P = .001; CAP vs healthy controls, R2 = 0.10 and P = .002), implying that dysbiosis occurred in their lung microbiota.

Microbiota in bronchoalveolar lavage fluid samples from patients with coronavirus disease 2019 (COVID-19), patients with community-acquired pneumonia (CAP), and healthy controls. A, Principal coordinates analysis of all samples. B, Heat map of microbiota composition after quality control filtering (filters were described in Methods). The CAP samples were labeled as virus names followed by numbers; patients with COVID-19 were represesnted by black rectangles, and 2 co-occurring bacterial clusters, by red rectangles. The names of all viruses are in blue, with contaminant genera reported by Salter and colleagues [27] in red. Abbreviations: EV, enterovirus; Flu, influenza; HAdVs, human adenovirus; HC, healthy control; HRV, human rhinovirus; MeV, measles virus; SARS–CoV-2, severe acute respiratory syndrome–coronavirus 2; PERMANOVA, permutational multivariate analysis of variance; RSV, respiratory syncytial virus.
Figure 3.

Microbiota in bronchoalveolar lavage fluid samples from patients with coronavirus disease 2019 (COVID-19), patients with community-acquired pneumonia (CAP), and healthy controls. A, Principal coordinates analysis of all samples. B, Heat map of microbiota composition after quality control filtering (filters were described in Methods). The CAP samples were labeled as virus names followed by numbers; patients with COVID-19 were represesnted by black rectangles, and 2 co-occurring bacterial clusters, by red rectangles. The names of all viruses are in blue, with contaminant genera reported by Salter and colleagues [27] in red. Abbreviations: EV, enterovirus; Flu, influenza; HAdVs, human adenovirus; HC, healthy control; HRV, human rhinovirus; MeV, measles virus; SARS–CoV-2, severe acute respiratory syndrome–coronavirus 2; PERMANOVA, permutational multivariate analysis of variance; RSV, respiratory syncytial virus.

Microbiotas could be classified into 3 types (Figure 3B). In particular, the microbiota in cluster I was dominated by the possible pathogens, whereas the microorganisms in other clusters were more diverse. By further inspecting the species belonging to each cluster (Supplementary Tables 5 and 6), we found that bacteria in type III were mainly commensal species frequently observed in the oral and respiratory tract, whereas bacteria in type II were mostly environmental organisms; contamination was therefore highly suspected. Therefore, the microbiota was either pathogen enriched (type I) or commensal enriched (type III) or undetermined owing to low microbial load (type II).

The microbiotas in 6 nCoV samples were pathogen enriched, and the other 2 were commensal enriched (Figure 3B). Moreover, 2 nCoV samples [2, 6] with an excess number of intrahost SARS–CoV-2 variants both had pathogen-enriched microbiotas. The overwhelming proportion of the virus may associate with a higher replication rate and could also potentially stimulate the intense immune response against the virus, under which circumstance, an excess number of intrahost mutations would be expected. However, because only 8 patients with SARS-CoV-2 were included in this analysis, and the absolute microbial load was unknown, more data are needed for further investigation.

DISCUSSION

RNA viruses have a high mutation rate owing to the lack of proofreading activity of polymerases. Consequently, RNA viruses are prone to evolve resistance to drugs and escape from immune surveillance. The mutation rate of SARS–CoV-2 is still unclear. However, considering that the median number of pairwise sequence differences was 4 (interquartile range, 3–6) for 110 sequences collected between 24 December 2019 and 9 February 2020, the mutation rate should be at the same order of magnitude in SARS-CoV (0.80–2.38 × 10-3 nucleotide substitution per site per year) [28]. The high mutation rate also results in a high level of intrahost variants in RNA viruses [24, 29]. The median number of intrahost variant in patients with COVID-19 was 4 for variants with frequency ≥5%, and this incidence did not differ significantly from that reported in a study on Ebola (655 variants with frequency ≥5% in 134 samples; P > .05) [24].

An exoribonuclease has been proposed to provide proofreading activity in SARS–CoV [30, 31], and we noted that all 3 key motifs in the gene were identical in SARS–CoV and SARS–CoV-2 (Supplementary Figure 3). In addition, neither polymorphism nor intrahost variant was detected in these motifs, suggesting that the gene is highly conserved. Although we did not find any mutation hot-spot genes in either polymorphism or intrahost variants, the observation of shared intrahost variants among different individuals implied the possibility of convergent or adaptive evolution of the virus in patients, which could potentially affect the antigenicity, virulence, and infectivity of the virus [6].

It is worth noting that the SARS–CoV-2 genome in patients could be highly diverse, which was also observed in other viruses [24]. The high diversity could potentially increase the fitness of the viral population, making it hard to be eliminated [29]. Further studies are needed to explore how this may influence the immune response toward the virus and whether there is a selection acting on different strains in the human body or during the transmission. In a single possible transmission event investigated in this study, we found no evidence for the transmission of multiple strains. However, it is unclear whether these intrahost variants occurred before the transmission or after the transmission, which would result in different conclusions. In addition, a bottleneck may be involved in the transmission, which could also decrease diversity [32]. Nevertheless, the observation of high mutation burden in some patients emphasized the possible rapid evolution of this virus, despite that its biological significance is largely unknown.

Recent studies have shown that the microbiota in the lung contributed to the immunological homeostasis and potentially altered susceptibility to viral infection. Meanwhile, the lung microbiota could also be regulated by invading viruses [9, 33]. However, besides finding that microbial diversity was significantly lower in patients with pneumonia than that in healthy controls (Figure 3B), we did not identify any specific microbiota pattern shared among patients with COVID-19, nor among patients with CAP. A possible reason for this could be the use of antibiotics in patients with pneumonia. However, this was not true for all pneumonia samples, because some samples had a substantial proportion of bacteria, including 2 samples from patients with COVID-19.

It is well known that secondary bacterial infection—a common complication of viral infection, especially for respiratory viruses—often significantly increases morbidity rates [34]. Thus, the elevated level of bacteria in the BALF samples from some patients with COVID-19 might increase the risk of secondary infection. In the clinical data, the secondary infection rate for COVID-19 was between 1% and 10% [2, 35]. However, the quantitative relationship between bacterial relative abundance/titer and infection is unclear.

Overall, our study has revealed the evolution of SARS–CoV-2 in the patient, a common feature shared by most RNA viruses. How these variants influence the fitness of viruses and genetic diversity in the population awaits further investigation. Currently, only limited sequences are shared in public databases (Supplementary Table 2); hence, there is an urgent need to accumulate more sequences to trace the evolution of the viral genome and associate the changes with clinical symptoms and outcomes.

Supplementary Data

Supplementary materials are available at Clinical Infectious Diseases online. Consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author.

Notes

Acknowledgments. The authors thank Prof. Xue Yongbiao and colleagues from the National Genomics Data Center for helpful discussion and computational resource support. We thank Prof. Huang Yanyi (Peking University, Beijing, China) and Prof. Wang Jianbin (Tsinghua University, Beijing) for providing the sequencing platform. We also thank Prof. Huang Chaolin for assistance with sample collection. We gratefully acknowledge the researchers who generated and shared the sequencing data through the Global Initiative on Sharing All Influencza Data (GISAID, https://www.gisaid.org/), on which some of our analysis is based. (A full list of all submitters is provided in Supplementary Table 2.)

Financial support. This work was supported by grants from the Chinese Academy of Medical Sciences (CAMS) Innovation Fund for Medical Sciences (grant 2016-I2M-1-014), the National Major Science & Technology Project for Control and Prevention of Major Infectious Diseases in China (grants 2017ZX10103004, 2018ZX10305409, 2018ZX10301401, and 2018ZX10732401), the National Natural Science Foundation of China (grant 31670169), and the Open Project of Key Laboratory of Genomic and Precision Medicine, Chinese Academy of Sciences.

Potential conflicts of interest. The authors: No reported conflicts of interest. All authors have submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest.

References

1.

Zhu
N
,
Zhang
D
,
Wang
W
, et al.
A novel coronavirus from patients with pneumonia in China, 2019
.
N Engl J Med
2020
;
382
:
727
33
.

2.

Huang
C
,
Wang
Y
,
Li
X
, et al.
Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China
.
Lancet
2020
;
395
:
497
506
.

3.

Zhao
S
,
Lin
Q
,
Ran
J
, et al.
Preliminary estimation of the basic reproduction number of novel coronavirus (2019-nCoV) in China, from 2019 to 2020: a data-driven analysis in the early phase of the outbreak
.
Int J Infect Dis
2020
;
92
:
214
7
.

4.

Zhou
P
,
Yang
X
,
Wang
X
, et al.
A pneumonia outbreak associated with a new coronavirus of probable bat origin
.
Nature
2020
; 579:270–3.

5.

Lucas
M
,
Karrer
U
,
Lucas
A
,
Klenerman
P
.
Viral escape mechanisms–escapology taught by viruses
.
Int J Exp Pathol
2001
;
82
:
269
86
.

6.

Berngruber
TW
,
Froissart
R
,
Choisy
M
,
Gandon
S
.
Evolution of virulence in emerging epidemics
.
PLoS Pathog
2013
;
9
:
e1003209
.

7.

O’Dwyer
DN
,
Dickson
RP
,
Moore
BB
.
The lung microbiome, immunity, and the pathogenesis of chronic lung disease
.
J Immunol
2016
;
196
:
4839
47
.

8.

Hanada
S
,
Pirzadeh
M
,
Carver
KY
,
Deng
JC
.
Respiratory viral infection-induced microbiome alterations and secondary bacterial pneumonia
.
Front Immunol
2018
;
9
:
2640
.

9.

Huffnagle
GB
,
Dickson
RP
,
Lukacs
NW
.
The respiratory tract microbiome and lung inflammation: a two-way street
.
Mucosal Immunol
2017
;
10
:
299
306
.

10.

Mandell
LA
,
Wunderink
RG
,
Anzueto
A
, et al. ;
Infectious Diseases Society of America; American Thoracic Society
.
Infectious Diseases Society of America/American Thoracic Society consensus guidelines on the management of community-acquired pneumonia in adults
.
Clin Infect Dis
2007
;
44(suppl 2)
:
S27
72
.

11.

National Genomics Data Center Members and Partners
.
Database resources of the national genomics data center in 2020
.
Nucleic Acids Res
2020
;
48
:
D24
D33
.

12.

Chen
S
,
Zhou
Y
,
Chen
Y
,
Gu
J
.
fastp: An ultra-fast all-in-one FASTQ preprocessor
.
Bioinformatics
2018
;
34
:
i884
90
.

13.

Clarke
EL
,
Taylor
LJ
,
Zhao
C
, et al.
Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments
.
Microbiome
2019
;
7
:
46
.

14.

Wang
J
,
Wang
W
,
Li
R
, et al.
The diploid genome sequence of an Asian individual
.
Nature
2008
;
456
:
60
5
.

15.

Kopylova
E
,
Noé
L
,
Touzet
H
.
SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data
.
Bioinformatics
2012
;
28
:
3211
7
.

16.

Camacho
C
,
Coulouris
G
,
Avagyan
V
, et al.
BLAST+: architecture and applications
.
BMC Bioinformatics
2009
;
10
:
421
.

17.

Huson
DH
,
Beier
S
,
Flade
I
, et al.
MEGAN community edition - interactive exploration and analysis of large-scale microbiome sequencing data
.
PLoS Comput Biol
2016
;
12
:
e1004957
.

18.

Li
H
.
Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM
.
arXiv
2013
;
3
:
13033997
.

19.

Li
H
,
Handsaker
B
,
Wysoker
A
, et al. ;
1000 Genome Project Data Processing Subgroup
.
The sequence alignment/map format and SAMtools
.
Bioinformatics
2009
;
25
:
2078
9
.

20.

Koboldt
DC
,
Zhang
Q
,
Larson
DE
, et al.
VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing
.
Genome Res
2012
;
22
:
568
76
.

21.

Elbe
S
,
Buckland-Merrett
G
.
Data, disease and diplomacy: GISAID’s innovative contribution to global health
.
Glob Chall
2017
;
1
:
33
46
.

22.

Shu
Y
,
McCauley
J
.
GISAID: Global Initiative on Sharing All Influenza Data—from vision to reality
.
Euro Surveill
2017
;
22
:
30494
.

23.

Zhang
Z
,
Li
J
,
Zhao
XQ
,
Wang
J
,
Wong
GK
,
Yu
J
.
KaKs_Calculator: calculating Ka and Ks through model selection and model averaging
.
Genomics Proteomics Bioinformatics
2006
;
4
:
259
63
.

24.

Ni
M
,
Chen
C
,
Qian
J
, et al.
Intra-host dynamics of Ebola virus during 2014
.
Nat Microbiol
2016
;
1
:
16151
.

25.

Ren
L
,
Zhang
R
,
Rao
J
, et al.
Transcriptionally active lung microbiome and its association with bacterial biomass and host inflammatory status
.
mSystems
2018
;
3
:
e00199-18
.

26.

Segal
LN
,
Clemente
JC
,
Tsay
JC
, et al.
Enrichment of the lung microbiome with oral taxa is associated with lung inflammation of a Th17 phenotype
.
Nat Microbiol
2016
;
1
:
16031
.

27.

Salter
SJ
,
Cox
MJ
,
Turek
EM
, et al.
Reagent and laboratory contamination can critically impact sequence-based microbiome analyses
.
BMC Biol
2014
;
12
:
87
.

28.

Zhao
Z
,
Li
H
,
Wu
X
, et al.
Moderate mutation rate in the SARS coronavirus genome and its implications
.
BMC Evol Biol
2004
;
4
:
21
.

29.

Domingo
E
,
Sheldon
J
,
Perales
C
.
Viral quasispecies evolution
.
Microbiol Mol Biol Rev
2012
;
76
:
159
216
.

30.

Minskaia
E
,
Hertzig
T
,
Gorbalenya
AE
, et al.
Discovery of an RNA virus 3’→5’ exoribonuclease that is critically involved in coronavirus RNA synthesis
.
Proc Natl Acad Sci U S A
2006
;
103
:
5108
13
.

31.

Smith
EC
,
Blanc
H
,
Surdel
MC
,
Vignuzzi
M
,
Denison
MR
.
Coronaviruses lacking exoribonuclease activity are susceptible to lethal mutagenesis: evidence for proofreading and potential therapeutics
.
PLoS Pathog
2013
;
9
:
e1003565
.

32.

Sigal
D
,
Reid
JNS
,
Wahl
LM
.
Effects of transmission bottlenecks on the diversity of influenza A virus
.
Genetics
2018
;
210
:
1075
88
.

33.

Tsang
TK
,
Lee
KH
,
Foxman
B
, et al.
Association between the respiratory microbiome and susceptibility to influenza virus infection
.
Clin Infect Dis
2019
;ciz968. doi:10.1093/cid/ciz968.

34.

Hendaus
MA
,
Jomha
FA
,
Alhammadi
AH
.
Virus-induced secondary bacterial infection: a concise review
.
Ther Clin Risk Manag
2015
;
11
:
1265
71
.

35.

Chen
N
,
Zhou
M
,
Dong
X
, et al.
Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study
.
Lancet
2020
;
395
:
507
13
.

Author notes

Z. S., Y. X., L. K., and W. M. contributed equally to this manuscript.

L. R. and M. L. contributed equally to this manuscript.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)