Impact of pathogen genetics on clinical phenotypes in a population of Talaromyces marneffei from Vietnam

Abstract Talaromycosis, a severe and invasive fungal infection caused by Talaromyces marneffei, is difficult to treat and impacts those living in endemic regions of Southeast Asia, India, and China. While 30% of infections result in mortality, our understanding of the genetic basis of pathogenesis for this fungus is limited. To address this, we apply population genomics and genome-wide association study approaches to a cohort of 336 T. marneffei isolates collected from patients who enrolled in the Itraconazole vs Amphotericin B for Talaromycosis trial in Vietnam. We find that isolates from northern and southern Vietnam form two distinct geographical clades, with isolates from southern Vietnam associated with increased disease severity. Leveraging longitudinal isolates, we identify multiple instances of disease relapse linked to unrelated strains, highlighting the potential for multistrain infections. In more frequent cases of persistent talaromycosis caused by the same strain, we identify variants arising over the course of patient infections that impact genes predicted to function in the regulation of gene expression and secondary metabolite production. By combining genetic variant data with patient metadata for all 336 isolates, we identify pathogen variants significantly associated with multiple clinical phenotypes. In addition, we identify genes and genomic regions under selection across both clades, highlighting loci undergoing rapid evolution, potentially in response to external pressures. With this combination of approaches, we identify links between pathogen genetics and patient outcomes and identify genomic regions that are altered during T. marneffei infection, providing an initial view of how pathogen genetics affects disease outcomes.


Introduction
The thermally dimorphic fungal pathogen, Talaromyces marneffei, causes talaromycosis, a severe fungal infection that affects immunocompromised people living in, or traveling to, endemic regions spanning Southeast Asia, India, and China (Qin et al. 2020;Narayanasamy, Dougherty, et al. 2021). People with advanced HIV disease are particularly affected, in whom the prevalence of talaromycosis can reach 26.5% in some regions (Qin et al. 2020). The impact of this disease is an estimated 4,900 (95% CI 2,500-7,300) deaths annually (Narayanasamy, Dat, et al. 2021).
Talaromycosis is difficult to treat, with mortality rates of up to 30% (Le et al. 2011;Jiang et al. 2019), and rates of infection increase throughout the rainy season, disproportionately impacting farmers and agricultural workers (Chariyalertsak et al. 1997;Narayanasamy, Dougherty, et al. 2021).
T. marneffei infection originates primarily through inhalation (Narayanasamy, Dougherty, et al. 2021). Once in the lungs, T. marneffei can disseminate to multiple organs, including the liver, spleen, lymph nodes, blood, bone marrow, bone, and skin (Supparatpinyo et al. 1994;Vanittanakom et al. 2006;Wong and Wong 2011;Narayanasamy, Dougherty, et al. 2021). The infectious propagule includes aerosolized conidia, which may transition to the pathogenic yeast form upon inhalation (Pruksaphon et al. 2022). Effective containment by lung-resident primary immune cells is critical to preventing dissemination, as T. marneffei possesses multiple virulence traits that enable its persistence within the human host. These virulence factors include the ability to grow at 37°C (Vanittanakom et al. 2006), the production of reactive oxygen species detoxifying enzymes that may promote survival within host macrophages (Pongpom et al. 2013), the secreted galactomannoprotein Mp1 that suppresses the proinflammatory host immune responses (Lam et al. 2019), and multiple laccases important for resisting external stressors and killing by phagocytes (Sapmak et al. 2016). The recommended treatment for Talaromycosis includes amphotericin B and itraconazole. Whilst there are no established breakpoints described for antifungal resistance, minimum inhibitory concentrations of azoles (0.008-16 μg/ ml) and amphotericin B (0.12-1 μg/ml) have been reported against T. marneffei (Stott et al. 2021;Fang et al. 2022). Further identification of genes important for virulence and resistance in T. marneffei is still needed to enable a better understanding of how Talaromyces adapts and survives within the host.
Recently generated complete genome assemblies for representative isolates of T. marneffei from north and south Vietnam (Cuomo et al. 2020) will enable further discovery of genes essential for virulence, as well as the evaluation of genetic variability across clinical isolates. Genetic variation in populations of many pathogenic fungi has been surveyed (Desjardins et al. 2017;Rhodes, Desjardins, et al. 2017Chow et al. 2020;O'Brien et al. 2021), and our initial understanding of variation across populations of T. marneffei has been guided by work that uncovered the population delineation of country-linked clades, based on a multilocus sequence typing (MLST) study (Henk et al. 2012). This prior work supplied an overview of the population structure of T. marneffei and provided a foundation for further analysis, based on whole genome sequencing, to better understanding of the genetic variation within clinical populations of T. marneffei. Characterizing genetic heterogeneity in populations of Aspergillus, Cryptococcus, and Candida has enabled a better understanding of resistance mechanisms and virulence traits through the study of aneuploidy (Selmecki et al. 2006;Poláková et al. 2009;Stone et al. 2019), copy number variation (Chow et al. 2020), and variant-phenotype association studies (Desjardins et al. 2017;Gerstein et al. 2019;Rhodes et al. 2022;Sephton-Clark et al. 2022). Population studies leveraging whole genome data have also allowed for the discovery of clade specific adaptation and evolution, such as the loss of subtelomeric adhesins in the only non-outbreak causing Candida auris clade (clade II) (Muñoz et al. 2021). In addition to understanding heterogeneity across populations, these approaches have enabled the interrogation of variability evolving in isolates over time, through the study of patient longitudinal and relapse isolates (Ormerod et al. 2013;Chen et al. 2017;Helmstetter et al. 2022), and microevolution of isolates passaged through infection models (Ene et al. 2018;Fu et al. 2021;Sephton-Clark et al. 2023).
In this study, we leverage a cohort of 336 clinical T. marneffei isolates from patients living with HIV in northern and southern Vietnam (Le et al. 2017) to explore population diversity, determine levels of recombination, identify genomic regions under selection, and assess genomic predictors of clinical phenotypes. With this approach, we identify two geographically distinct populations showing similar levels of recombination within and between clades. Notably, we find that clinical aspects of disease severity differ between isolates from these two clades. We employ longitudinal samples isolated from blood and skin lesions to identify variants arising over the course of invasive infection. We observe five instances of disease recrudescence/relapse that are linked to the independent introduction of an unrelated isolate, in addition to the more frequent detection of persistent infection by the same isolate. We identify multiple genomic regions that appear to be under selection across these two populations, with copy number variation arising throughout the population, and multiple genetic variants showing a significant association with relevant clinical parameters including a high initial fungal burden in the blood, mortality, relapse, immune reconstitution inflammatory syndrome (IRIS), and poor treatment response.

Sample collection and antifungal susceptibility testing
The 336 clinical T. marneffei isolates were obtained from cultures of blood and/or skin lesions of 272 patients from five hospitals across Vietnam who participated in the Itraconazole vs Amphotericin B for Talaromycosis (IVAP) trial conducted between October 2012 and December 2016 in Vietnam (Le et al. 2017). The in vitro antifungal susceptibility of all 336 T. marneffei yeast isolates against itraconazole and amphotericin B was determined using the M27-A3 reference method for testing yeasts as per the clinical and laboratory standards institute (CLSI) (Rex et al. 2008). The minimum inhibitory concentration (MIC) was determined by both visual and optical endpoints with growth inhibition of 90% for amphotericin B and 50% for itraconazole according to the CLSI. The study was approved by the ethics and scientific committees of the University of Oxford, all five participating hospitals in Vietnam, and Duke University. All patients gave informed consent for their T. marneffei clinical isolates to be stored and used for this research.

Clinical metadata
De-identified clinical metadata was collected and shared by the IVAP trial investigators. These data included a range of clinical measures collected during the IVAP trial and deemed to be nonidentifiable. These measures include presence of fever, measures of baseline fungal burden in colony forming units (CFUs) per ml of blood, rates of fungal clearance in CFUs/ml/day, 24-week mortality, time to clinical resolution (defined as resolution of fever, skin lesions, and positive blood culture), incidence of relapse and IRIS, and respiratory failure over 24 weeks of follow-up. These patient outcome variables were originally defined in the IVAP trial (Le et al. 2017). All statistical analyses were performed in R 3.6.0.

Genomic DNA isolation and sequencing
DNA was purified using the yeast genomic DNA purification kit (VWR). Briefly, cells were disrupted with sodium dodecyl sulphate (SDS) (VWR) at 65°C. The pellet was then treated with ammonium acetate and RNase A (VWR). DNA was precipitated with isopropanol and stored in Tris-EDTA buffer at a pH of 8.0 (VWR). DNA purity was assessed via Nanodrop (Thermo Fisher Scientific) and 1% agarose gel electrophoresis. The DNA concentration was measured via Qubit fluorometer (Thermo Fisher Scientific). Initial library construction for the isolates entailed tagmentation via the Nextera XT DNA library preparation protocol (Illumina). Isolates were initially sequenced on a HiSeq 3000 (Illumina), generating 100-bp paired reads. Samples with poor (<10X) or biased coverage resulting from this first method then underwent additional sequencing, with libraries generated in accordance with the NEBNext Ultra II DNA Library Prep protocol (New England Biolabs). Libraries were sequenced on a HiSeq X10 (Illumina), generating 150-bp paired reads (minimum coverage 10X per sample).

Population genomics and genome-wide association studies
A maximum likelihood phylogeny was estimated using 281,773 segregating SNP sites present in one or more isolates, allowing ambiguity in a maximum of 10% of samples, with RAxML version 8.2.12 GTRCAT rapid bootstrapping (1,000 replicates) (Stamatakis 2014). The phylogeny generated was visualized with ggtree . These SNP sites were also used to compute an unrooted phylogenetic network with splitstree4 (Huson and Bryant 2006). To assess support for clades identified via phylogenetic methods, MavericK performed estimation of K. Due to compute and convergence constraints, 100 iterations of rMavericK MCMC (Verity and Nichols 2016) was run for all samples, with a random subset of 1,000 variants selected per iteration. To assess recombination, linkage disequilibrium (LD) decay was estimated with vcftools version 0.1.16, based on all variants passing the filtering steps outlined above, for all isolates included in the population being measured (northern clade, southern clade, and all isolates). To do this, we calculated LD for 1000-bp windows, with a minimum minor allele frequency of 0.1, and the -hap-r2 option (Danecek et al. 2011). LD50 values were identified based on the number of bases at which the R 2 values, for each population, halved from the highest initial values at 1 kb. To identify regions of the genome under selection within these populations, PopGenome 2.7.5 was used to perform composite likelihood ratio (CLR) analysis, assess nucleotide diversity and divergence, and calculate Tajima's D and F-Statistic scores (per chromosome, by 5-kb windows) (Pfeifer et al. 2014). Copy number variation was calculated with CNVnator v0.3 (Abyzov et al. 2011). Association analysis between clinical data, in vitro phenotypes, and variants was carried out using PLINK v1.08p formatted files and Gemma version 0.94.1 (Zhou and Stephens 2012) (options: centered relatedness matrix gk 1, linear mixed model), as previously described (Desjardins et al. 2017).

Centromere identification
To identify centromere locations, the GC% was calculated across individual chromosomes with a sliding window approach in R 3.6.0. Repeat motifs across the proposed centromere regions were identified with RepeatModeler 2.0 (Flynn et al. 2020). Recombination rates across individual chromosomes were estimated with a random subset of 50 isolates spanning north and southern clades, via Ldhelmet 1.10 per the best practices workflow, which recommends using a window size of 50 SNPs, computing Padé coefficients, and calculating a mutation matrix (Chan et al. 2012).

Characterization of genome structure
For this study, we leveraged a complete T. marneffei genome generated from isolate 11-CN-03-130. This assembly consists of eight chromosomes, with telomeric sequences present at both ends of all chromosomes excepting one ending in ribosomal DNA repeats (Cuomo et al. 2020). This haploid genome encodes 10,025 genes, including rRNA content on chromosome 3, with a BUSCO completeness score of 97.7% (Cuomo et al. 2020). To determine the location of the centromeric regions of these chromosomes, we calculated GC% across all chromosomes and detected single regions displaying notable reductions in GC% for each chromosome, corresponding to likely centromeric positions (Fig. 1a). To determine the composition of these proposed centromeric regions, we identified repetitive sequences based both on similarity to known elements and de novo self-alignments, identifying long interspersed nuclear elements (LINE) and DNA/TcMar-Fot1 transposons throughout these putative centromeric regions (Fig. 1b). The repetitive content across these regions is consistent with the repetitive nature of centromere regions observed across other species of filamentous fungi (Smith et al. 2012). These proposed regions range from 15.5 to 28 kb in length, with an average centromere length for all chromosomes of 23 kb. To assess whether rates of recombination across these proposed centromeric regions are lower when compared to the non-centromeric regions, we calculated recombination rates per chromosome, with variant information from a subset of 50 isolates sequenced in this study. We saw similar rates of recombination across chromosomes (Supplementary Figure 1), and while we did not observe any reduction in recombination rate that might be associated with centromeric locations, the proposed centromeric regions displayed rates that were in line with those observed across the entire chromosome.

Population structure and geographical clades
To finely characterize genetic variation of clinical populations, we compared the genomes of T. marneffei isolated from HIV-positive patients enrolled in the IVAP trial across northern and southern Vietnam (Le et al. 2017). All individuals were HIV-positive and had T. marneffei isolates obtained from blood cultures taken at enrollment (Day 0). For a subset of individuals, T. marneffei isolates were obtained from both cultures taken at enrollment (Day 0), and from longitudinal time points for both blood and skin sites. Longitudinal samples were collected between Day 4 and Day 240, with the median longitudinal sample collection time being 1 month. We performed whole genome sequencing on all isolates collected, requiring a minimum genome coverage of 10X. We then called variants for these isolates, using a reference genome assembly (Cuomo et al. 2020) of isolate 11CN-03-130, also collected from southern Vietnam as part of the IVAP trial.
To assess population structure, we generated a maximum likelihood phylogeny from segregating SNP site information (Fig. 2). Isolates clearly separate into two clades based on bootstrap support, SplitsTree analysis, and K = 2 determination via MavericK for this population. There is a significant difference between evidence scores for K = 1 and K = 2, but overlapping distributions (IQR) of -log 2 evidence scores for K = 2 and K = 3 (K = 2: Q1 −18347, Q3 −17,632. K = 3: Q1 −17901, Q3 −17,337). These two clades represent the northern and southern regions of Vietnam from which they were isolated. This pattern is consistent with prior geographical substructure with clade separation based on region of collection (Henk et al. 2012). Clinical and in vitro susceptibility phenotypes appear well distributed across the phylogeny for these isolates (Fig. 2, Supplementary Figure 2). Of these isolates, 168 harbor the MAT1-2 locus, found across both northern and southern clades. Northern isolates show slightly higher levels of diversity (Table 1) and possess significantly longer terminal branch lengths when compared to southern clade isolates (Mann-Whitney U-Test P < 0.00001).

Fig. 2.
Phylogeny of patient isolates reveals no genetic clusters of severe outcomes. A maximum likelihood phylogeny was estimated using segregating SNP sites. Isolates separate distinctly into northern and southern clades, with both clades having 100% bootstrap support. The colored circles correspond to clade and sample type. Metadata for the clinical isolates includes initial fungal burden for a patient, indicated by the Log 10 CFU/ml heatmap, and patient outcomes that are indicated by colored squares in the outer perimeter of the phylogeny.
To assess recombination levels between and across these two clades, we calculated LD decay for the northern clade, the southern clade, and all isolates combined (Supplementary Figure 3 and Table 1). We see similar rates of decay, and identical LD50 values (12 kb) across all groups, indicating that neither clade is recombining exclusively within their group.

Clinical outcomes and longitudinal samples
To assess whether clinical phenotypes or in vitro antifungal susceptibility values significantly differed by clade, we performed a series of statistical analyses to assess associations between clade and phenotype. We did not observe statistically significant differences between clades in considering 24-week mortality, rates of relapse, IRIS, or MICs of itraconazole and amphotericin B. However, some striking differences were detected. While isolates from the southern clade have comparable rates of fungal clearance to isolates from the northern clade ( Supplementary  Figure 4a), the southern isolates exhibit significantly higher baseline fungal burden (Supplementary Figure 4b) (Wilcoxon rank sum test, P = 0.024), significantly higher prevalence of fever (Fisher exact probability test, P = 0.0002), and significantly decreased incidence rates of respiratory failure (Fisher exact probability test, 0.0021) ( Table 2). Given the increased presence of fever and baseline fungal burden across isolates from the southern clade, we sought to assess whether additional factors including patient age, antiretroviral (ARV) treatment status, or CD4 counts were confounders in these analyses. For patients included in this analysis, the median age was 33, the median CD4 count was 10, and 44% of patients had previously received ARV treatment. Neither age, ARV treatment status, or CD4 counts were significantly correlated with the presence of fever or baseline fungal burden, indicating that isolate clade is a contributing factor to some aspects of disease severity in patients from Vietnam.
To assess the genetic relationships between isolates sampled from different body sites (skin vs blood) of a patient at the same time point, and isolates sampled at different time points from the same patient (pre vs post-treatment), we calculated the number of SNP differences between each isolate. Over 80% of isolates sampled at longitudinal time points show a small number of SNP differences (≤10) when compared to their baseline counterparts (Tables 3 and 4), indicating that these longitudinal samples represent the original infecting isolates, having undergone a small number of genetic changes during infection. Some of the genes that were impacted by SNPs arising during infection include a predicted Myb-like binding domain (EYB25_007616), a transcription factor (EYB25_005954), an AMP-binding protein (EYB25_000740), an ORC DNA replication protein (EYB25_004486), and a cytochrome P450 (EYB25_003095). We performed additional tests to assess differences in the MICs between incident and post-drug exposure isolates. We observed a general trend of longitudinal isolates maintaining (62%) or increasing (12%) MIC values over time when compared to their baseline counterparts.
We also observed that a subset of longitudinal isolates from four patients appear unrelated to the primary infecting isolate, based on phylogenetic placement and SNP differences. In total, we observed four longitudinal blood isolates and two contemporaneous skin isolates ( Table 5) that appeared unrelated to the initial blood isolate, or matched time point blood isolate, respectively. With SNP differences between these isolates and their initial or blood counterparts ranging from 1,072 to 2,810 (Table 5), it is likely that these cases represent novel strain introductions or pre-existing multistrain infections, as opposed to a persistent infection of the same isolate over time, suggesting that patients may be exposed to multiple strains of Talaromyces prior to and during active infections, resulting in multistrain infections.

Genomic variants associated with clinical outcomes
Given the range of clinical phenotypes and outcomes associated with isolates from both clades, we leveraged this variability to assess significant associations between clinical phenotypes and pathogen genetics. We took a genome-wide association study (GWAS) approach to test for significance between loss-of-function (LoF) (frameshift and stop-codon gain) mutations and the clinical metadata suggestive of severe infections. Four patients with multiple infecting strains were excluded from this analysis. We saw LoF variants associate significantly (P < 5 × 10 −6 , gemma score test) with poor clinical outcomes including mortality, relapse, IRIS, poor clinical response, and high baseline fungal burden (CFU/ml) ( Table 6). These variants impacted genes with predicted roles regulating gene expression (methyltransferase and GAL4 transcription factor homolog), metabolism and fermentation (alcohol acetyltransferase, GTPase activating protein, and malate synthase), cellular signaling (calcineurin-like phosphoesterase), protein processing (oligosaccharyltransferase, ubiquitin-protein ligase, and exocyst complex component Sec3), and transport (HCO 3 transporter, sugar transporter, and major facilitator superfamily transporter). These isolates also varied in the MIC of itraconazole and amphotericin B, with a small number of isolates displaying notably elevated MIC values (Supplementary Figure 5). While a GWAS did not identify any LoF variants significantly associated with higher MICs to these antifungals, there was very limited power in this analysis due to the low number of isolates with high MIC values. Genetic variation across a population may not only contribute to virulence but may also be selected for through environmental pressures. The variants identified here may give some advantage to survival within the environment of the human host and may be working in tandem with other forms of genetic variation, such as copy number, to elicit more severe clinical phenotypes.

Population diversity and copy number variation
To investigate larger scale genetic variation, we performed copy number variation analysis. We observed no evidence of chromosomal aneuploidy across these isolates but identified multiple genomic regions that appeared duplicated and deleted within this population, relative to the reference genome. Many of these regional duplications and deletions appeared unique to individual isolates ( Fig. 3) with most isolates possessing both deletions and duplications across different loci; however, common deletions were also detected. The most commonly deleted region encodes a predicted afadin-and alpha-actinin-binding region that is deleted across 141 isolates from the northern clade and 125 isolates from the southern clade, relative to the reference strain. The largest number of isolates sharing a duplication across a single locus is 17, duplicated in isolates from the southern clade (this region encodes two reverse transcriptases) and a GAG-pre-integrase domain, likely corresponding to a retrotransposon integration. Of those regions deleted relative to the reference strain that are shared across more than 50 isolates, these encode genes predicted to function as transcription factors (six genes), REDOX proteins (five genes), protein kinases (four genes), methyltransferases (three genes), aspartic proteases (two genes), dehydrogenases (two genes), a cytochrome P450, a glycosyl hydrolase, a glycosyl transferase, a phospholipase, a phosphorylase, an adenylosuccinate lyase, an actin binding protein, a decarboxylase, an aminotransferase, an apoptosis regulating protein, an autophagy    These unrelated isolates did not co-localize across the phylogeny.
regulating protein, an arrestin, and a meiotic nuclear division protein. The range in functionality of genes lost across more than 50 isolates highlights the diversity within this clinical dataset. Of those duplicated regions shared across more than two isolates, these can be characterized by their predicted functions as heat shock and stress response genes, including heat shock protein 70; cell membrane and wall modulating genes, including a chitin synthase and a phosphoylcholine transferase; metabolic regulatory genes, including a cytochrome P450, a thiolase, a glycosyl hydrolase, an aspartyl protease, and a citrate synthase; and gene expression regulatory genes, including a transcription factor and an mRNA capping protein. These results suggest that genes with roles in heat shock, cell wall structure, and metabolic response could provide an advantage to T. marneffei in this setting.   Fig. 3. Frequencies of regional genomic duplications and deletions per chromosome. Frequency of deleted regions (left of chromosome axis) and duplicated regions (right of chromosome axis) assessed across all isolates, with frequency of deletion or duplication represented as distance from axis of the chromosomes (bars). 1000-bp windows were used to assess copy number variation. GC content across chromosomes represented by black (GC 17-40%) and red (GC 50-57%) bars plotted on the chromosome bars.
We did not observe any duplications of the CYP51 ortholog within these isolates, suggesting alternate mechanisms may be responsible for the range in MIC values seen to itraconazole.

Population genomics identify regions under selection
To assess regions that may be under selection within these populations, we performed a population genetics analysis to identify rapidly evolving regions across these isolates. We noted strong signals present across many telomeric and sub-telomeric regions (Fig. 4a), in tests for selective pressure (CLR) (Nielsen et al. 2005), nucleotide diversity (Pi), and population divergence (Dxy). To determine if these regions encode common functions, we performed enrichment analysis on all genes within the sub-telomeric regions. The sub-telomeric regions displaying strong signals for selective pressure were enriched (hypergeometric test, P < 0.05) for genes predicted to play roles in thiamine diphosphate metabolism and energy derivation by oxidation of organic compounds (Supplementary Table 1). When northern and southern clades were assessed independently via CLR analysis, non-sub-telomeric regions exhibiting strong signals for selective pressure across both populations were enriched (hypergeometric test, P < 0.05) for genes predicted to function in ion and organic anion transport, respiration, regulation of gene expression, and energy derivation by oxidation of organic compounds (Supplementary Table 1). Non-sub-telomeric regions displaying signals of population divergence between clades were enriched (hypergeometric test, P < 0.05) for genes predicted to function in anaerobic respiration (Supplementary Table 1).
To assess specific genes under positive selective pressures across these clades, we performed a CLR analysis and found multiple regions with strong signals of selection across northern and southern clades, of which we will highlight four regions of interest (Fig. 4b). Region 1 is under selection across isolates in both clades, but with relatively lower nucleotide diversity and CLR scores in the smaller northern clade (Fig. 4a, region 1). This region shows lower levels of divergence between clades (Fst and Dxy), and appears to be rapidly evolving across all isolates with little clade specificity. This region encodes genes predicted to function as transcription factors (three genes), protein kinases (three genes), a ferric reductase, a methyltransferase, an adenylosuccinate lyase, and a phospholipase. Two additional regions displayed strong selective scores (CLR) across both clades in addition to elevated scores of divergence between clades (Fst and Dxy) (Fig. 4a,  regions 2 and 3). While region 2 shows high levels of nucleotide diversity for both clades, region 3 shows increased nucleotide diversity restricted to the northern clade, accompanied by an increased Tajima's D score for this population. When region 2 and region 3 are interrogated further (Fig. 4b), these two regions divide based on diversity within clade, with tracts of low diversity (Pi) within clades but high divergence (Dxy and Fst) observed between clades, indicating the potential for clade-specific adaptations in these regions. Region 2 encodes genes predicted to function as a cytochrome P450 and a glycosyl hydrolase. Region 3 encodes genes predicted to function as a transcription factor, a protein kinase, an aspartyl protease, and a pyridoxal-dependent decarboxylase.
To determine whether any of these signals may be due to a recent selective sweep, or balancing selection, we assessed scores for Tajima's D. We identified one region that shared a strong signal in both our CLR analysis and our Tajima's D analysis (Fig. 4a, region 4), indicating that this region may have recently undergone a selective sweep. We note that this region also shows an elevated Dxy score, but no increase in Fst between the two populations, indicating increased levels of divergence between the two populations at this region, accompanied by high levels of within-population diversity, as reflected in our measurements of nucleotide diversity (Pi) for each clade at this region. This region encodes 14 genes and multiple ankyrin repeats (EYB25_006920-006938) in total. The predicted functions of these genes include a transcription factor, an endoplasmic reticulum membrane protein, an arrestin, a meiotic nuclear division protein, a glycosyl transferase, a cell surface protein, an alcohol dehydrogenase, and an autophagy protein. These regions feature genes that may be playing important roles across and between clades, highlighting candidate genes for future functional analysis.

Discussion
This large-scale genomic study of T. marneffei has begun to shed light on the population variability that exists within a large endemic area of talaromycosis. Phylogenetic analysis of 336 clinical isolates revealed a population of two major clades, consistent with a previous MLST study that identified geographically structured clades of T. marneffei (Henk et al. 2012). The clades representing both northern and southern Vietnam identified through MLST are recapitulated here using WGS based SNP phylogenies. Across both clades, the MAT1-2 locus is present, occurring in 50% of isolates; however, we are unable to confirm the presence of the MAT1-1 locus in the non-MAT1-2 isolates, due to the reference genomes used in this study (Cuomo et al. 2020) possessing the MAT1-2 locus. We calculate similar rates of LD decay for both clades individually as for together, indicating that recombination is taking place at similar levels within and between clades. Southern isolates show slightly increased clonality, harboring alleles with fewer SNPs and displaying significantly shorter terminal branches than isolates from the northern clade, highlighting the possibility of a more recent introduction.
Leveraging the clinical metadata available for these isolates, we investigated the possibility of clade-based differences with regards to clinical phenotypes and outcomes. We did not observe significant differences between the northern and southern clades with respect to 24-week mortality, rates of relapse, IRIS, clinical response to treatment, or MICs of itraconazole and amphotericin B. However, we detected significant associations between the southern clade and aspects of disease severity, including higher baseline fungal burden and presence of fever. In addition, we identified multiple loss-of-function variants significantly associated with mortality, relapse, IRIS, poor clinical response, and high baseline fungal burden. We noted that many of these clinical phenotypes may occur at low rates for the group sizes, likely impacting our ability to detect both significant differences between clades, and significant associations between variants and phenotypes. In addition, there are many confounding factors that contribute to these clinical outcomes, including host immune status, host response, timeliness of treatment, and antifungal treatment regimen. While we did not find that age, CD4 count, or ARV treatment status confounded associations between the southern clade and higher baseline fungal burden or presence of fever, larger studies of talaromycosis outcomes should be completed to provide increased power for estimations of clinical differences between clades to confirm these findings. Further work using larger group sizes for genotype-phenotype association studies will also be essential for building on our GWAS findings.
Applying a range of population genetics approaches with the variant data generated for these clinical isolates, we interrogated signals of selective pressure, increased nucleotide diversity, and population divergence for these clades. We observed shared signals of rapidly evolving regions present in northern and southern clades, with the strongest shared signals spanning regions containing genes with roles in gene expression regulation, iron sequestration, cellular signaling, and metabolism. Regions exhibiting signals of divergence between populations encoded genes with roles in metabolism and adaptation to environmental niches, sugar metabolism, transcriptional regulation, and protein cleavage. Of note, we observed strong signals of selection at the telomeric and subtelomeric regions, in line with observations for other fungal pathogens such as Cryptococcus neoformans (Desjardins et al. 2017;Sephton-Clark et al. 2022). While sugar transporters are known to be enriched in these regions of C. neoformans, the subtelomeric regions displaying signals of selection in T. marneffei are enriched for genes involved in energy derivation by oxidation of organic compounds, and thiamine diphosphate metabolism.
The phylogenetic analysis of both longitudinal samples and samples obtained from multiple body sites allowed for the inference of infections resulting from single, or multiple infecting isolates. We observed six instances in which longitudinal isolates obtained from a patient appeared unrelated to the primary infecting isolate, hinting at novel strain introductions, or the presence of multiple infecting strains. This is the first time this approach has been applied to T. marneffei to determine whether patients with protracted talaromycosis or infection recrudescence have been re-infected with novel isolates. This approach could also enable the identification of patients that have been chronically exposed, resulting in multiple contained primary infections that break up in the event of becoming immunocompromised. Sequencing of additional isolates at each time point would enable increased resolution to explore this possibility. The majority of patients that underwent longitudinal isolate collection showed continued infection with the basal isolate, in line with similar observations for C. neoformans Rhodes, Beale, et al. 2017).
Combining whole genome sequencing data and patient data for this clinical cohort has enabled a better understanding of the relatedness of infecting isolates, within individual patients, and within the broader population. This approach enables identification of pathogen loci that appear under selection across both geographic clades, and highlights multiple genes that have been repeatedly, and independently, lost or duplicated across these isolates. Patient metadata allowed for association analyses that have given us insights into genetic signatures of the pathogen that are significantly associated with specific patient outcomes. Combining data types such as these, on an ever-larger scale, will afford a better understanding of the genetic basis of pathogenicity in fungal pathogens such as T. marneffei.

Data availability
All sequencing data are available in NCBI via the accession PRJNA949141.
Supplemental material available at GENETICS online.