-
PDF
- Split View
-
Views
-
Cite
Cite
Emanuel Schmid-Siegert, Sophie Richard, Amanda Luraschi, Konrad Mühlethaler, Marco Pagni, Philippe M Hauser, Expression Pattern of the Pneumocystis jirovecii Major Surface Glycoprotein Superfamily in Patients with Pneumonia, The Journal of Infectious Diseases, Volume 223, Issue 2, 15 January 2021, Pages 310–318, https://doi.org/10.1093/infdis/jiaa342
- Share Icon Share
Abstract
The human pathogen Pneumocystis jirovecii harbors 6 families of major surface glycoproteins (MSGs) encoded by a single gene superfamily. MSGs are presumably responsible for antigenic variation and adhesion to host cells. The genomic organization suggests that a single member of family I is expressed at a given time per cell, whereas members of the other families are simultaneously expressed.
We analyzed RNA sequences expressed in several clinical samples, using specific weighted profiles for sorting of reads and calling of single-nucleotide variants to estimate the diversity of the expressed genes.
A number of different isoforms of at least 4 MSG families were expressed simultaneously, including isoforms of family I, for which confirmation was obtained in the wet laboratory.
These observations suggest that every single P. jirovecii population is made of individual cells with distinct surface properties. Our results enhance our understanding of the unique antigenic variation system and cell surface structure of P. jirovecii.
The fungal genus Pneumocystis belongs to the subphylum Taphrinomycotina of the Ascomycota [1]. It encompasses extracellular parasites that colonize the lungs of mammals [2]. The species infecting humans is Pneumocystis jirovecii, whereas rats and mice harbor Pneumocystis carinii and Pneumocystis murina, respectively. Should the human immune system decline, P. jirovecii can become an opportunistic pathogen that causes severe pneumonia, which can be lethal if not treated (Pneumocystis pneumonia [PCP]). An in vitro method of long-term culture for Pneumocystis species is still not available.
The lack of a culture method complicates the study of P. jirovecii pathogenicity. Its major surface glycoproteins (MSG) constitute potentially a crucial factor involved in colonization and/or virulence. These proteins are thought to generate surface antigenic variation allowing escape from the human immune system [3–5]. Besides, the MSGs are thought to be involved in adhesion to host cells [6, 7]. MSGs are encoded by 6 families of hypervariable genes located at all subtelomeric regions of the approximately 20 chromosomes of P. jirovecii [8–10]. These families form the largest surface protein superfamily known among fungi with approximately 160 msg genes per genome (families I, II, III, IV, V, and VI include, respectively, approximately 85, 20, 10, 20, 20, and 5 genes encoding isoforms; family IV is the only one potentially not anchored in the cell surface because of the lack of a glycosylphosphatidylinositol signal).
MSGs in family I are thought to be the most abundant at the P. jirovecii cell surface [11], although only 1 msg-I gene of approximately 80 present in the genome is probably expressed at a given time in every single cell. Indeed, first, the latter mutually exclusive expression would rely on the expression of a single gene under the control of a transcription promoter that is present at a single copy per genome (within a upstream conserved sequence [UCS]), whereas the other genes of family I have no promoter. Second, Pneumocystis cells are mostly haploid, except transiently during the sexual cycle [12–15]. However, at the population level, several different msg-I isoforms linked to the UCS at the DNA level are observed, and thus presumably expressed [9, 16]. In P. carinii, the diversity of msg-I isoforms expressed has also been observed at the RNA level [17], as well at the protein level, which revealed a focal distribution of the epitopes within the lung [18]. The exchange of the expressed msg-I isoform would occur on recombination at a sequence of 33 base pairs (bp) that is present both at the end of the UCS including the promoter and at the beginning of each msg-I gene (the conserved recombination junction element).
By contrast, each gene of the other 5 MSG families II–VI possesses its own promoter, allowing potentially independent and simultaneous expression [9]. In addition to the exchange of the msg-I isoform expressed, antigenic variation is thought to rely on recombinations between the genes of each MSG family that generate gene mosaicism [3, 9, 19]. We propose a model for the surface antigenic variation system of P. jirovecii, consisting of the continuous segregation of new subpopulations that are antigenically different (Figure 1).

Model for the antigenic variation system of Pneumocystis jirovecii. Only 4 chromosome ends of approximately 40 are shown in each cell. The fungus continuously segregates new cells, each expressing a new msg-I isoform, as well as all mosaic isoforms of the other major surface glycoprotein families, except possibly family VI (see text). Consequently, each P. jirovecii population is subdivided into several subpopulations that are antigenically different and that could multiply or not. The single msg-I isoform is expressed at a high level, whereas the isoforms of the other families are transcribed at a low level. The proximal location of the msg-I genes within the subtelomeres, closest to the telomeres, suggests that the exchange of the expressed isoform might be facilitated by the concomitant exchange of the telomere through a single recombination between 2 conserved recombination junction elements (CRJEs). The recombinations between the msg genes of each family that generate gene mosaicism contributing to antigenic variation are not figured. Abbreviation: UCS, upstream conserved sequence.
The aim of the present study was to challenge this model by investigating the expression of the different P. jirovecii MSG families at the RNA level in clinical samples of patients with PCP. However, the classic approach of mapping RNA sequencing (RNAseq) reads onto genomic sequences was inadequate, for the following reasons. First, ambiguous mapping of the RNAseq reads and thus erroneous signals could result from both the repetitive nature of these genes because they are made of conserved domains [8–10] and the presence of identical sequences in several msg genes due to the recombinations creating mosaicism by gene conversion. Second, the set of subtelomeres present in the cells of each subpopulation probably varies considerably, because a different msg-I gene may be linked to the UCS and because recombinations may occur frequently between msg genes. Consequently, only the most stable part of the subtelomeres that is present in the majority of the cells can be assembled, at least until single-cell sequencing is adapted to P. jirovecii. Thus, a complete set of subtelomeres that could be used as a reference for mapping RNAseq reads does not exist, even when one is analyzing the genome assembly and RNAseq reads from the same P. jirovecii isolate. To circumvent these limitations, we developed dedicated bioinformatics procedures to analyze the expression of the MSG genes.
MATERIAL AND METHODS
RNA Extraction and Whole-Transcriptome Amplification
Total RNAs were extracted from the bronchoalveolar lavage fluid (BALF) specimens (see Supplementary Material) using the RiboPure yeast kit (Ambion). The whole transcriptome was amplified from each RNA preparation using the SeqPlex RNA Amplification Kit (Sigma). The procedure resulted in complementary DNAs (cDNAs) with a mean size of 100–150 bp, as revealed using a 2100 Bioanalyzer system (Agilent Technologies). According to the manufacturer, such size (ie, below 200–400 bp) indicates degradation of input RNA. Only patient 2’s BALF specimen led to a mean cDNA size of 250 bp. RNA degradation was expected because of the uncontrolled period between collection of the BALF specimens from the patient and arrival in our laboratory, as well as the complex and varying microbiota present in these samples. Our group previously observed varying RNA degradation in BALF specimens by amplifying specific transcripts [20, 21].
Because the SeqPlex RNA Amplification Kit generated cDNAs too small for most samples (100–150 bp), the absence of genomic DNA in the RNA preparations was checked on larger cDNAs (approximately 800 bp) obtained from the same RNA preparations using the REPLI-g WTA Single Cell Kit (Qiagen) involving random amplification as the SeqPlex RNA Amplification Kit. This check included (1) the lack of amplification in the absence of reverse-transcription (ie, directly on RNA) and (2) the lack of intron in the polymerase chain reaction (PCR) product from the unrelated gene encoding β-tubulin, as we performed previously [20, 21].
Probe Design
The procedure of enrichment in P. jirovecii RNA using bait probes was derived from that described for Candida albicans [22, 23]. Our SureSelect capture library included a total of 43 793 biotinylated 120-nucleotide bait probes that were designed using eArray software and the 1× tiling option (Agilent Technologies; https://earray.chem.agilent.com/earray/). The probes were head to tail and nonoverlapping. They covered completely the length of each open reading frame (ORF). However, no probes was placed when <120 nucleotides remained at the end of the ORF, so that a small bias was introduced at the 3’ end that was more pronounced for small genes. The probes covered a total of 4135 P. jirovecii ORFs, that is, all 3772 of the reference assembly of the P. jirovecii genome Pneu-jiro_RU7_V2, including 181 msg genes ([8]; 8.4-Mb haploid genome; approximately 4.0-Mb ORFeome), 83 genes and 25 pseudogenes of all 6 MSG families described elsewhere [9], as well as 255 msg genes from previous publications (Supplementary Table 1).
The msg genes and pseudogenes were added to ensure enrichment in msg cDNAs (see text). The UCS including the single copy promoter of MSG family I was also included (locus T551_00002). The mitochondrial ORFs and the single copy ribosomal DNA genes were not covered, whereas the ORFs encoding ribosomal proteins were. The probes that mapped onto the human transcriptome present in the resources of the University of California, Santa Cruz, using the BLAT Search Genome tool (https://genome.ucsc.edu/) were discarded. There were an average of 10.0 probes for each ORF or pseudogene.
Preparation of RNAseq Libraries With Enrichment in P. jirovecii RNA
RNA libraries for RNAseq were prepared using the Agilent SureSelectXT targeted cDNA enrichment kit for multiplexed Illumina sequencing (Agilent Technologies; manufacturer’s reference G9611A), using the 200-ng sample preparation procedure without shearing of DNA, and 16 PCR cycles for the “indexing and sample processing for multiplex sequencing” step. The only change to the manufacturer’s instructions has been that the samples were dried for 3–5 minutes at room temperature rather than at 37°C after all purifications with AMPure XP beads. Briefly, double-stranded cDNA was produced with adapters ligated to both ends of the cDNAs, allowing subsequent amplification using primers matching the adapters.
The addition of primers to the cDNAs during the procedure was checked using the Agilent Bioanalyzer system. Each library received a different index that allowed several libraries to be sequenced together (multiplexing). Amplified double-stranded cDNA was incubated at 65°C for 24 hours with our capture library of biotinylated probes, described above. The hybridized sequences were captured with magnetic streptavidin beads. They were next linearly amplified using provided primers and indexed in a new PCR. The nonenriched sample 1NE was sequenced directly after the SeqPlex RNA Amplification Kit, without applying the Agilent SureSelectXT targeted cDNA enrichment kit.
RNAseq Protocol
Libraries resulting from the Agilent SureSelectXT targeted cDNA enrichment kit were sequenced on an Illumina MiSeq system with a Micro Reagent Kit v2 (300 cycles). Sequencing data were processed using Illumina bcl2fastq2 conversion software v2.20. RNAseq paired Illumina reads were merged using BBMap software (version 37.82). Merged reads mapping onto the human reference genome (grch38) using HiSat2 software (version 2.26.0) were discarded, and nonmapping reads were extracted using a combination of samtools (version 1.8, parameters: view -h -b -f 4) and bedtools (version 2.26.0, parameters: bamtofastq). The obtained human-filtered merged reads were deduplicated using the program cd-hit-dup from cd-hit software (version 4.6.8). The proportion of P. jirovecii sequences versus human ones in each sample of RNAseq reads was determined using a splice-aware mapper with standard settings (STAR 2.6.0c). The other procedures are described in the Supplementary Material.
RESULTS
RNAseq of Clinical Samples With Enrichment in P. jirovecii RNA
To study the expression of the P. jirovecii MSG superfamily, we analyzed total RNAs extracted from the BALF specimens of 6 patients with PCP. The proportion of P. jirovecii RNA in such samples being low (approximately 3%) [24], an enrichment step was required. To that aim, we used the Agilent SureSelectXT targeted cDNA enrichment kit relying on hybridization to bait probes covering the whole P. jirovecii ORFeome (complete set of protein coding sequences). To ensure enrichment in msg cDNAs (complementary DNA synthetized from RNA), probes were also derived from published msg gene sequences (Supplementary Table 1). The latter approach was based on the presence of conserved motifs within msg genes and the fact that the kit allows mismatches in order to detect sequence variants.
The enrichment procedure increased the proportion of P. jirovecii RNA to 20%–60% (Table 1; see enriched and nonenriched samples from patient 1). After elimination of the human reads, the merged Illumina RNAseq paired sequence reads (150–250 nucleotides) were deduplicated to avoid biases due to the PCR amplification steps included in the sample preparation. These samples of reads, with characteristics provided in Table 1, were subsequently analyzed.
Characteristics of the RNA Sequencing Reads From Bronchoalveolar Lavage Fluid Specimens of 6 Patients With Pneumocystis Pneumonia
Characteristic . | Samplea . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|
. | 1E . | 1NE . | 2Ea . | 2Eb . | 3E . | 4E . | 5E . | 6E . |
Underlying disease | HIV | HIV | KT | KT | ALL | PNET | MM | HIV |
Pneumocystis jirovecii by real-time PCR, ×106 copies/mL | 22.9 | 22.9 | 0.7 | 0.7 | 1111 | 3.1 | 4.2 | 57.4 |
Read pairs, ×106 | 12.6 | 8.1 | 2.7 | 1.0 | 1.0 | 11.0 | 0.5 | 2.7 |
Merged read pairs, ×106 | 5.5 | 4.3 | 2.0 | 0.9 | 0.9 | 1.0 | 0.4 | 2.5 |
P. jirovecii reads in merged read pairs, % | 55 | 0.2 | 21 | 23 | 62 | 26 | 52 | 57 |
Human reads in merged read pairs, % | 23 | 89 | 50 | 57 | 7 | 39 | 13 | 5 |
Deduplicated merged read pairs without human reads, ×103 | 1499 | 464 | 703 | 126 | 183 | 153 | 73 | 383 |
Characteristic . | Samplea . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|
. | 1E . | 1NE . | 2Ea . | 2Eb . | 3E . | 4E . | 5E . | 6E . |
Underlying disease | HIV | HIV | KT | KT | ALL | PNET | MM | HIV |
Pneumocystis jirovecii by real-time PCR, ×106 copies/mL | 22.9 | 22.9 | 0.7 | 0.7 | 1111 | 3.1 | 4.2 | 57.4 |
Read pairs, ×106 | 12.6 | 8.1 | 2.7 | 1.0 | 1.0 | 11.0 | 0.5 | 2.7 |
Merged read pairs, ×106 | 5.5 | 4.3 | 2.0 | 0.9 | 0.9 | 1.0 | 0.4 | 2.5 |
P. jirovecii reads in merged read pairs, % | 55 | 0.2 | 21 | 23 | 62 | 26 | 52 | 57 |
Human reads in merged read pairs, % | 23 | 89 | 50 | 57 | 7 | 39 | 13 | 5 |
Deduplicated merged read pairs without human reads, ×103 | 1499 | 464 | 703 | 126 | 183 | 153 | 73 | 383 |
Abbreviations: ALL, acute lymphocytic leukemia; BALF, bronchoalveolar lavage fluid; HIV, human immunodeficiency virus; KT, kidney transplantation; MM, multiple myeloma; PCR, polymerase chain reaction; PNET, primitive neuroectodermal tumor.
aSample names denote the patient number (1–6), followed by E for enriched (in P. jirovecii complementary DNA, using the Agilent SureSelect enrichment kit) or NE for nonenriched; a and b indicate duplicates from the same sample (from patient 2).
Characteristics of the RNA Sequencing Reads From Bronchoalveolar Lavage Fluid Specimens of 6 Patients With Pneumocystis Pneumonia
Characteristic . | Samplea . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|
. | 1E . | 1NE . | 2Ea . | 2Eb . | 3E . | 4E . | 5E . | 6E . |
Underlying disease | HIV | HIV | KT | KT | ALL | PNET | MM | HIV |
Pneumocystis jirovecii by real-time PCR, ×106 copies/mL | 22.9 | 22.9 | 0.7 | 0.7 | 1111 | 3.1 | 4.2 | 57.4 |
Read pairs, ×106 | 12.6 | 8.1 | 2.7 | 1.0 | 1.0 | 11.0 | 0.5 | 2.7 |
Merged read pairs, ×106 | 5.5 | 4.3 | 2.0 | 0.9 | 0.9 | 1.0 | 0.4 | 2.5 |
P. jirovecii reads in merged read pairs, % | 55 | 0.2 | 21 | 23 | 62 | 26 | 52 | 57 |
Human reads in merged read pairs, % | 23 | 89 | 50 | 57 | 7 | 39 | 13 | 5 |
Deduplicated merged read pairs without human reads, ×103 | 1499 | 464 | 703 | 126 | 183 | 153 | 73 | 383 |
Characteristic . | Samplea . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|
. | 1E . | 1NE . | 2Ea . | 2Eb . | 3E . | 4E . | 5E . | 6E . |
Underlying disease | HIV | HIV | KT | KT | ALL | PNET | MM | HIV |
Pneumocystis jirovecii by real-time PCR, ×106 copies/mL | 22.9 | 22.9 | 0.7 | 0.7 | 1111 | 3.1 | 4.2 | 57.4 |
Read pairs, ×106 | 12.6 | 8.1 | 2.7 | 1.0 | 1.0 | 11.0 | 0.5 | 2.7 |
Merged read pairs, ×106 | 5.5 | 4.3 | 2.0 | 0.9 | 0.9 | 1.0 | 0.4 | 2.5 |
P. jirovecii reads in merged read pairs, % | 55 | 0.2 | 21 | 23 | 62 | 26 | 52 | 57 |
Human reads in merged read pairs, % | 23 | 89 | 50 | 57 | 7 | 39 | 13 | 5 |
Deduplicated merged read pairs without human reads, ×103 | 1499 | 464 | 703 | 126 | 183 | 153 | 73 | 383 |
Abbreviations: ALL, acute lymphocytic leukemia; BALF, bronchoalveolar lavage fluid; HIV, human immunodeficiency virus; KT, kidney transplantation; MM, multiple myeloma; PCR, polymerase chain reaction; PNET, primitive neuroectodermal tumor.
aSample names denote the patient number (1–6), followed by E for enriched (in P. jirovecii complementary DNA, using the Agilent SureSelect enrichment kit) or NE for nonenriched; a and b indicate duplicates from the same sample (from patient 2).
Assignment of RNAseq Reads and MSG Expression Analysis
Specific weighted profiles (similar to consensus sequences) based on published msg gene sequences were generated for each of the 6 MSG families. Despite that a single gene sequence exists, specific profiles were also generated for 8 control genes that can be considered housekeeping genes (except superoxide dismutase). These profiles were used to assign RNAseq reads to 1 MSG family or control gene, using a conservative best hit approach (Table 2; see Methods). The reproducibility of the procedure was assessed according to the similarity of the results obtained for 2 independent analyses of patient 2’s BALF specimens (samples 2Ea and 2Eb). Importantly, the results of the enriched and nonenriched samples from patient 1 were also similar (samples 1E and 1NE). This latter finding validated the procedure of enrichment in P. jirovecii RNA, including in msg transcripts.
Proportion of RNA Sequencing Read Samples Assigned to 1 Major Surface Glycoprotein Family or Control Genea
MSG Family or Control Gene . | Proportion of Sample, %b . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|
. | 1E (182 389)c . | 1NE (1213) . | 2Ea (61 248) . | 2Eb (10 641) . | 3E (7324) . | 4E (7106) . | 5E (8948) . | 6E (49 497) . |
MSG familyd | ||||||||
I (A1) | 79.4 | 87.4 | 82.5 | 78.3 | 95.0 | 77.1 | 87.6 | 86.1 |
II (A3) | 3.3 | 2.5 | 3.1 | 2.6 | 1.1 | 2.1 | 3.4 | 3.3 |
III (A3) | 6.0 | 5.4 | 8.2 | 14.5 | 2.8 | 17.7 | 7.1 | 6.1 |
IV (B) | 1.6 | 1.1 | 2.1 | 2.4 | 0.2 | 3.0 | 0.8 | 4.3 |
V (D) | 0.3 | 0.2 | 1.4 | 0.1 | 0.2 | 0 | 0 | 0 |
VI (E) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Actin 1 | 0.3 | 0.3 | 0.1 | 0 | 0 | 0 | 0 | 0 |
α-Tubulin | 4.2 | 1.8 | 1.8 | 1.4 | 0.2 | 0 | 0 | 0 |
β -Tubulin | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Dihydrofolate reductase | 1.8 | 0.4 | 0.5 | 0.7 | 0.5 | 0.1 | 0 | 0.1 |
Dihydropteroate synthase | 0.3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Elongation factor 2 | 1.4 | 0.5 | 0.3 | 0.7 | 0 | 0 | 1.2 | 0 |
Elongation factor 3 | 1.3 | 0.5 | 0.2 | 0 | 0 | 0 | 0 | 0.1 |
Superoxide dismutase | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
MSG Family or Control Gene . | Proportion of Sample, %b . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|
. | 1E (182 389)c . | 1NE (1213) . | 2Ea (61 248) . | 2Eb (10 641) . | 3E (7324) . | 4E (7106) . | 5E (8948) . | 6E (49 497) . |
MSG familyd | ||||||||
I (A1) | 79.4 | 87.4 | 82.5 | 78.3 | 95.0 | 77.1 | 87.6 | 86.1 |
II (A3) | 3.3 | 2.5 | 3.1 | 2.6 | 1.1 | 2.1 | 3.4 | 3.3 |
III (A3) | 6.0 | 5.4 | 8.2 | 14.5 | 2.8 | 17.7 | 7.1 | 6.1 |
IV (B) | 1.6 | 1.1 | 2.1 | 2.4 | 0.2 | 3.0 | 0.8 | 4.3 |
V (D) | 0.3 | 0.2 | 1.4 | 0.1 | 0.2 | 0 | 0 | 0 |
VI (E) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Actin 1 | 0.3 | 0.3 | 0.1 | 0 | 0 | 0 | 0 | 0 |
α-Tubulin | 4.2 | 1.8 | 1.8 | 1.4 | 0.2 | 0 | 0 | 0 |
β -Tubulin | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Dihydrofolate reductase | 1.8 | 0.4 | 0.5 | 0.7 | 0.5 | 0.1 | 0 | 0.1 |
Dihydropteroate synthase | 0.3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Elongation factor 2 | 1.4 | 0.5 | 0.3 | 0.7 | 0 | 0 | 1.2 | 0 |
Elongation factor 3 | 1.3 | 0.5 | 0.2 | 0 | 0 | 0 | 0 | 0.1 |
Superoxide dismutase | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Abbreviation: MSG, major surface glycoprotein.
aRNA sequencing reads were assigned to 1 MSG family or gene using specific weighted profiles.
bSample names denote the patient number (1–6), followed by E for enriched (in P. jirovecii complementary DNA, using the Agilent SureSelect enrichment kit) or NE for nonenriched; a and b indicate duplicates from the same sample (from patient 2). Parenthetical numbers in column heads represent total number of merged reads (without human, deduplicated) assigned to 1 MSG family or control gene.
cTotal number of merged reads (without human, de-duplicated) assigned to one MSG family or control gene.
dThe MSG family nomenclature of Ma et al [8] appears in parentheses.
Proportion of RNA Sequencing Read Samples Assigned to 1 Major Surface Glycoprotein Family or Control Genea
MSG Family or Control Gene . | Proportion of Sample, %b . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|
. | 1E (182 389)c . | 1NE (1213) . | 2Ea (61 248) . | 2Eb (10 641) . | 3E (7324) . | 4E (7106) . | 5E (8948) . | 6E (49 497) . |
MSG familyd | ||||||||
I (A1) | 79.4 | 87.4 | 82.5 | 78.3 | 95.0 | 77.1 | 87.6 | 86.1 |
II (A3) | 3.3 | 2.5 | 3.1 | 2.6 | 1.1 | 2.1 | 3.4 | 3.3 |
III (A3) | 6.0 | 5.4 | 8.2 | 14.5 | 2.8 | 17.7 | 7.1 | 6.1 |
IV (B) | 1.6 | 1.1 | 2.1 | 2.4 | 0.2 | 3.0 | 0.8 | 4.3 |
V (D) | 0.3 | 0.2 | 1.4 | 0.1 | 0.2 | 0 | 0 | 0 |
VI (E) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Actin 1 | 0.3 | 0.3 | 0.1 | 0 | 0 | 0 | 0 | 0 |
α-Tubulin | 4.2 | 1.8 | 1.8 | 1.4 | 0.2 | 0 | 0 | 0 |
β -Tubulin | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Dihydrofolate reductase | 1.8 | 0.4 | 0.5 | 0.7 | 0.5 | 0.1 | 0 | 0.1 |
Dihydropteroate synthase | 0.3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Elongation factor 2 | 1.4 | 0.5 | 0.3 | 0.7 | 0 | 0 | 1.2 | 0 |
Elongation factor 3 | 1.3 | 0.5 | 0.2 | 0 | 0 | 0 | 0 | 0.1 |
Superoxide dismutase | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
MSG Family or Control Gene . | Proportion of Sample, %b . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|
. | 1E (182 389)c . | 1NE (1213) . | 2Ea (61 248) . | 2Eb (10 641) . | 3E (7324) . | 4E (7106) . | 5E (8948) . | 6E (49 497) . |
MSG familyd | ||||||||
I (A1) | 79.4 | 87.4 | 82.5 | 78.3 | 95.0 | 77.1 | 87.6 | 86.1 |
II (A3) | 3.3 | 2.5 | 3.1 | 2.6 | 1.1 | 2.1 | 3.4 | 3.3 |
III (A3) | 6.0 | 5.4 | 8.2 | 14.5 | 2.8 | 17.7 | 7.1 | 6.1 |
IV (B) | 1.6 | 1.1 | 2.1 | 2.4 | 0.2 | 3.0 | 0.8 | 4.3 |
V (D) | 0.3 | 0.2 | 1.4 | 0.1 | 0.2 | 0 | 0 | 0 |
VI (E) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Actin 1 | 0.3 | 0.3 | 0.1 | 0 | 0 | 0 | 0 | 0 |
α-Tubulin | 4.2 | 1.8 | 1.8 | 1.4 | 0.2 | 0 | 0 | 0 |
β -Tubulin | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Dihydrofolate reductase | 1.8 | 0.4 | 0.5 | 0.7 | 0.5 | 0.1 | 0 | 0.1 |
Dihydropteroate synthase | 0.3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Elongation factor 2 | 1.4 | 0.5 | 0.3 | 0.7 | 0 | 0 | 1.2 | 0 |
Elongation factor 3 | 1.3 | 0.5 | 0.2 | 0 | 0 | 0 | 0 | 0.1 |
Superoxide dismutase | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Abbreviation: MSG, major surface glycoprotein.
aRNA sequencing reads were assigned to 1 MSG family or gene using specific weighted profiles.
bSample names denote the patient number (1–6), followed by E for enriched (in P. jirovecii complementary DNA, using the Agilent SureSelect enrichment kit) or NE for nonenriched; a and b indicate duplicates from the same sample (from patient 2). Parenthetical numbers in column heads represent total number of merged reads (without human, deduplicated) assigned to 1 MSG family or control gene.
cTotal number of merged reads (without human, de-duplicated) assigned to one MSG family or control gene.
dThe MSG family nomenclature of Ma et al [8] appears in parentheses.
All 8 samples of reads from the 6 patients gave comparable results. The vast majority of the assigned reads were from MSG family I (77.1%–95.0%), the second most important population was from MSG family III (2.8%–17.7%), and all other MSG families and control genes were less represented (0.1%–4.2%). Transcripts of MSG family VI, superoxide dismutase, and β-tubulin were not detected. These observations demonstrated that, at the population level, genes of all MSG families are expressed, except possibly those of family VI. Genes of families I and III are expressed at a higher level than the other families and all control genes. The level of expression of the genes of family I was the highest, that is, 0–50 times higher than the housekeeping genes investigated.
In Silico Estimation of the Diversity of the MSG Genes Expressed
The diversity of the genes expressed of each MSG family was estimated by calling single-nucleotide variants within a window sliding along the alignment of the RNAseq reads with the specific weighted profile. The optimal size of the window to count haplotypes for all MSG families was determined to be 30 bp (Supplementary Figure 1A). Because the RNAseq reads were deduplicated, the number of haplotypes (sequences with specific single-nucleotide variants) obtained can be considered a surrogate of the number of the different msg isoforms expressed. This number depended on the lowest proportion of the reads among those present in the window used to support each haplotype, especially for family I (Supplementary Figure 1B). To avoid detecting sequencing errors while being sensitive enough, we chose the proportion of 0.01 for all of our analyses, because it is the usual error rate within Illumina reads [25, 26].
For all MSG families in all samples of RNAseq reads, the number of haplotypes identified was most often proportional to the read coverage along the profile (eg, family I in sample 3E; Figure 2A). The only exceptions were for family I in samples 1E and 2Ea, which had more reads than the other samples (Figure 2B and 2C). The peaks of coverage at 3’ and especially 5’ regions in Figure 2 are likely to result from RNA degradation, as well as from a gene-specific degradation pattern [27]. Interestingly, the 2 samples with sufficient coverage, 1E and 2Ea, provided drastically reduced numbers of msg-I haplotypes at the same 4 locations along the profile, approximately at positions 100, 2000, 2300, and 2500 bp. These positions might correspond to conserved regions between protein domains, where recombinations between these genes occur preferentially.

In silico estimation of the diversity of msg-I isoforms present among samples of RNA sequencing (RNAseq) reads. Single-nucleotide variants called within a 30-base pair (bp) sliding window defined 30-bp haplotypes. Black lines show the number of haplotypes identified along the msg gene; streaked lines, the number of reads analyzed in each window. A, Sample 3E. B, Sample 1E. C, Sample 2Ea.
We calculated the median of the numbers of haplotypes obtained along each MSG profile for all samples of RNAseq reads (Table 3; some values could not be obtained because of insufficient read coverage). All of these values should be considered minimal because of the conservative parameters used and the dependence on coverage. The observed number of haplotypes varied from 3 to 21 for MSG family I, and from 1 to 4 for the other families. The 2 samples (1E and 2Ea) with sufficient coverage for family I both provided a value of approximately 20. These results suggested (1) that family I presents the highest diversity of isoforms expressed during P. jirovecii infection and (2) that a number of different isoforms of each MSG family are expressed, except possibly for families V (only 1 haplotype detected) and VI (no reads detected).
MSG Familya . | Haplotypes per Sample, Medianb . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|
. | 1E (165 244) . | 1NE (1172) . | 2Ea (59 594) . | 2Eb (10 418) . | 3E (7273) . | 4E (7099) . | 5E (8850) . | 6E (49 398) . |
I (A1) | 18 | NDc | 21 | 4 | 3 | ND | 2 | 6 |
II (A3) | 4 | ND | 4 | ND | ND | ND | ND | ND |
III (A3) | 1 | ND | 4 | ND | ND | ND | ND | ND |
IV (B) | 2 | ND | 3 | ND | ND | ND | ND | ND |
V (D) | ND | ND | ND | 1 | ND | ND | ND | ND |
VI (E) | ND | ND | ND | ND | ND | ND | ND | ND |
MSG Familya . | Haplotypes per Sample, Medianb . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|
. | 1E (165 244) . | 1NE (1172) . | 2Ea (59 594) . | 2Eb (10 418) . | 3E (7273) . | 4E (7099) . | 5E (8850) . | 6E (49 398) . |
I (A1) | 18 | NDc | 21 | 4 | 3 | ND | 2 | 6 |
II (A3) | 4 | ND | 4 | ND | ND | ND | ND | ND |
III (A3) | 1 | ND | 4 | ND | ND | ND | ND | ND |
IV (B) | 2 | ND | 3 | ND | ND | ND | ND | ND |
V (D) | ND | ND | ND | 1 | ND | ND | ND | ND |
VI (E) | ND | ND | ND | ND | ND | ND | ND | ND |
Abbreviations: MSG, major surface glycoprotein; ND, not determined.
aThe MSG family nomenclature of Ma et al [8] appears in parentheses.
bThe median number of expressed haplotypes was determined for each major MSG family, using single-nucleotide variants calling within the RNA sequencing reads within a sliding window. Sample names denote the patient number (1–6), followed by E for enriched (in P. jirovecii complementary DNA, using the Agilent SureSelect enrichment kit) or NE for nonenriched; a and b indicate duplicates from the same sample (from patient 2). Parenthetical numbers in column heads represent the total number of MSG RNA sequencing reads analyzed.
cNot determined because of insufficient read coverage (or no reads for family VI).
MSG Familya . | Haplotypes per Sample, Medianb . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|
. | 1E (165 244) . | 1NE (1172) . | 2Ea (59 594) . | 2Eb (10 418) . | 3E (7273) . | 4E (7099) . | 5E (8850) . | 6E (49 398) . |
I (A1) | 18 | NDc | 21 | 4 | 3 | ND | 2 | 6 |
II (A3) | 4 | ND | 4 | ND | ND | ND | ND | ND |
III (A3) | 1 | ND | 4 | ND | ND | ND | ND | ND |
IV (B) | 2 | ND | 3 | ND | ND | ND | ND | ND |
V (D) | ND | ND | ND | 1 | ND | ND | ND | ND |
VI (E) | ND | ND | ND | ND | ND | ND | ND | ND |
MSG Familya . | Haplotypes per Sample, Medianb . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|
. | 1E (165 244) . | 1NE (1172) . | 2Ea (59 594) . | 2Eb (10 418) . | 3E (7273) . | 4E (7099) . | 5E (8850) . | 6E (49 398) . |
I (A1) | 18 | NDc | 21 | 4 | 3 | ND | 2 | 6 |
II (A3) | 4 | ND | 4 | ND | ND | ND | ND | ND |
III (A3) | 1 | ND | 4 | ND | ND | ND | ND | ND |
IV (B) | 2 | ND | 3 | ND | ND | ND | ND | ND |
V (D) | ND | ND | ND | 1 | ND | ND | ND | ND |
VI (E) | ND | ND | ND | ND | ND | ND | ND | ND |
Abbreviations: MSG, major surface glycoprotein; ND, not determined.
aThe MSG family nomenclature of Ma et al [8] appears in parentheses.
bThe median number of expressed haplotypes was determined for each major MSG family, using single-nucleotide variants calling within the RNA sequencing reads within a sliding window. Sample names denote the patient number (1–6), followed by E for enriched (in P. jirovecii complementary DNA, using the Agilent SureSelect enrichment kit) or NE for nonenriched; a and b indicate duplicates from the same sample (from patient 2). Parenthetical numbers in column heads represent the total number of MSG RNA sequencing reads analyzed.
cNot determined because of insufficient read coverage (or no reads for family VI).
In Vitro Assessment of the Diversity of msg-I Isoforms Expressed
We assessed at the DNA level the diversity of msg-I isoforms expressed in the P. jirovecii population infecting each of the 6 patients. To that aim, the repertoire of these genes was amplified from each BALF specimens’s genomic DNA using primers localized in the UCS containing the single copy promoter and at the end of the genes. The PCR product was subcloned and several subclones were sequenced. The samples of all 6 patients presented a significant diversity of msg-I isoforms expressed; that is, 27%–80% of 10–15 subclones sequenced were unique (Table 4). Only 2 patients shared a single sequence (patients 1 and 2).
. | Patient No. . | . | . | . | . | . |
---|---|---|---|---|---|---|
. | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
Subclones sequenced, no. | 12 | 11 | 10 | 10 | 10 | 15 |
Unique subclones, % (no.) | 58 (7) | 27 (3) | 80 (8) | 70 (7) | 30 (3) | 33 (5) |
Mean identity of subclones, % (range) | 69 (66–81) | 72 (68–75) | 68 (63–80) | 69 (63–78) | 98 (97–98) | 73 (65–88) |
. | Patient No. . | . | . | . | . | . |
---|---|---|---|---|---|---|
. | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
Subclones sequenced, no. | 12 | 11 | 10 | 10 | 10 | 15 |
Unique subclones, % (no.) | 58 (7) | 27 (3) | 80 (8) | 70 (7) | 30 (3) | 33 (5) |
Mean identity of subclones, % (range) | 69 (66–81) | 72 (68–75) | 68 (63–80) | 69 (63–78) | 98 (97–98) | 73 (65–88) |
aThe repertoire of expressed msg-I genes (ie, linked to and thus under the control of the single copy promoter present in the upstream conserved sequence) was amplified by means of polymerase chain reaction (PCR) from the genomic DNA of the patient’s bronchoalveolar lavage fluid specimen; the PCR product was subcloned in a plasmid, and approximately 850 base pairs of the 5’ end of each subclone were sequenced.
. | Patient No. . | . | . | . | . | . |
---|---|---|---|---|---|---|
. | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
Subclones sequenced, no. | 12 | 11 | 10 | 10 | 10 | 15 |
Unique subclones, % (no.) | 58 (7) | 27 (3) | 80 (8) | 70 (7) | 30 (3) | 33 (5) |
Mean identity of subclones, % (range) | 69 (66–81) | 72 (68–75) | 68 (63–80) | 69 (63–78) | 98 (97–98) | 73 (65–88) |
. | Patient No. . | . | . | . | . | . |
---|---|---|---|---|---|---|
. | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
Subclones sequenced, no. | 12 | 11 | 10 | 10 | 10 | 15 |
Unique subclones, % (no.) | 58 (7) | 27 (3) | 80 (8) | 70 (7) | 30 (3) | 33 (5) |
Mean identity of subclones, % (range) | 69 (66–81) | 72 (68–75) | 68 (63–80) | 69 (63–78) | 98 (97–98) | 73 (65–88) |
aThe repertoire of expressed msg-I genes (ie, linked to and thus under the control of the single copy promoter present in the upstream conserved sequence) was amplified by means of polymerase chain reaction (PCR) from the genomic DNA of the patient’s bronchoalveolar lavage fluid specimen; the PCR product was subcloned in a plasmid, and approximately 850 base pairs of the 5’ end of each subclone were sequenced.
DISCUSSION
The human pathogenic fungus P. jirovecii harbors most probably a system of surface antigenic variation, presumably ensuring both escape from host immunity and adhesion to target cells. This system involves 6 families of hypervariable surface glycoproteins, the MSGs, family I being under mutually exclusive expression at the individual cell level. In the present study, we analyzed the pattern of expression of the genes encoding these proteins at the RNA level. The results were similar in 6 patients with PCP. The msg transcripts included members of ≥5 families. The level of expression in families I and III was higher than those in the other MSG families and housekeeping genes. Family I was by far expressed at the highest level. A number of different isoforms of ≥4 of the 6 families were expressed.
Importantly, the 6 patients whom we investigated were each coinfected with several P. jirovecii strains (see Methods), so the results are means from several strains. Nevertheless, the similarity of the results from the 6 patients suggests that all strains expressed the MSGs similarly. We did not detect transcripts of the MSG family VI. In P. murina, the proteins of this family are present only at the surface of the ascospores, within asci or recently released from the asci [28]. This observation suggested a particular regulation of this family during the cell cycle. Because asci generally represent a minority of approximately 5% in the infecting P. jirovecii populations [29], it is possible that the transcripts encoding these proteins were present in amounts too low to be detected by our procedure. Interestingly, the genes of this family are all localized in the distal region of the subtelomeres, most distant from the telomeres and closest to the genomic genes. Moreover, the recombinations between them are less frequent than between the genes of the other MSG families [9]. A relationship between chromosomal location, expression in ascospores, and low frequency of mosaicism is likely to exist.
Our observations support several aspects of our model for the P. jirovecii antigenic variation system (Figure 1). The observed expression of several msg-I isoforms within each infecting population is consistent with the hypothesis of a continuous segregation of subpopulations expressing each a different single isoform. The value of 20 different isoforms that we observed in 2 patients is close to the single value of 18 reported so far [9]. The latter value as well as those reported in the present work being all minimal estimations, it is likely that a higher diversity of family I is actually expressed. This hypothesis is also suggested by the peaks of up to approximately 30–35 haplotypes that we observed for this family (Figure 2B and 2C, at position 227).
As far as the other MSG families are concerned, the mean number of haplotypes observed was always dependent on the read coverage. Nevertheless, peaks up to 10–20 haplotypes were observed for all families in sample 1E with the highest coverage, suggesting that all families might also present an important diversity of genes expressed. The expression of several isoforms of all MSG families that we observed, except possibly for family V and VI, is compatible with the postulated independent expression, thanks to the promoter that each gene possesses. However, our analyses did not allow us to assess whether all genes of each family were transcribed. Thus, it remains to determine whether these latter genes are constitutively expressed, subject to a regulation during the cell cycle, and/or silenced owing to the proximity of the telomeres (by the “telomere position effect”) [30].
The expression at a very high level of msg-I isoforms that we observed in P. jirovecii is consistent with our model and previous studies at the protein level [8, 11]. This high expression might be due to transcription enhancement driven by the intron present in the UCS, which is larger than that present in the other MSG families [9, 10]. On the other hand, the high expression of family III is a new feature. The latter may not be due to the presence of 2 introns of common size for P. jirovecii (40–60 bp) close to the promoter of msg-III genes, because a similar arrangement is present for msg-II genes that are not overexpressed.
We analyzed immunocompromised patients with active PCP. However, the antigenic surface variation system of P. jirovecii has probably evolved in immunocompetent humans without PCP, and thus is above all a colonization factor. Colonized individuals include potentially several categories of humans, for example, infants experiencing primary infection, transient carriers (eg, healthcare workers in contact with patients with PCP), patients with chronic lung diseases, pregnant women, and elderly people [31]. In these colonized individuals, the number of subpopulations expressing a different msg-I isoform could be reduced by the valid immune system. This hypothesis deserves to be tested to help elucidate the surface antigenic variation system of P. jirovecii.
In conclusion, our results enhance the understanding of the mechanisms involved in the surface antigenic variation of P. jirovecii, as well as its cell surface structure. The postulated strategy to produce continuously subpopulations that are antigenically distinct would be unique among human pathogens and might be associated with the nonsterile niche within lungs [9, 32]. By contrast, pathogens that occupy sterile niches (blood, tissue), such as Plasmodium and Trypanosoma, rely on cell populations that are antigenically homogeneous. Further work is needed to decipher the unique surface antigenic variation system of P. jirovecii.
Supplementary Data
Supplementary materials are available at The Journal of Infectious Diseases online. Consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author.
Notes
Acknowledgments. We thank Dr Michael Walter of the Agilent Diagnostic and Genomics group for his help in the design of the probes and RNA extraction protocol. We also thank Thierry Schuepbach for help in the initial phase of the analyses of the haplotypes numbers. Sequencing was performed at the Lausanne Genomic Technologies Facility, University of Lausanne, Switzerland. Computations were performed at the Vital-IT Center for High-Performance Computing of the Swiss Institute of Bioinformatics (http://www.vital-it.ch).
Disclaimer. The funder had no role in any steps of the study.
Financial support. This work was supported by the Swiss National Science Foundation (grant 310030_165825).
Potential conflicts of interest. All authors: No potential conflicts. All authors have submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. Conflicts that the editors consider relevant to the content of the manuscript have been disclosed.
References
Author notes
M. P. and P. M. H. contributed equally to this work.