Genomic epidemiology and evolutionary dynamics of respiratory syncytial virus group B in Kilifi, Kenya, 2015–17

Abstract Respiratory syncytial virus (RSV) circulates worldwide, occurring seasonally in communities, and is a leading cause of acute respiratory illness in young children. There is paucity of genomic data from purposively sampled populations by which to investigate evolutionary dynamics and transmission patterns of RSV. Here we present an analysis of 295 RSV group B (RSVB) genomes from Kilifi, coastal Kenya, sampled from individuals seeking outpatient care in nine health facilities across a defined geographical area (∼890 km2), over two RSV epidemics between 2015 and 2017. RSVB diversity was characterized by multiple virus introductions into the area and co-circulation of distinct genetic clusters, which transmitted and diversified locally with varying frequency. Increase in relative genetic diversity paralleled seasonal virus incidence. Importantly, we identified a cluster of viruses that emerged in the 2016/17 epidemic, carrying distinct amino-acid signatures including a novel nonsynonymous change (K68Q) in antigenic site ∅ in the Fusion protein. RSVB diversity was additionally marked by signature nonsynonymous substitutions that were unique to particular genomic clusters, some under diversifying selection. Our findings provide insights into recent evolutionary and epidemiological behaviors of RSVB, and highlight possible emergence of a novel antigenic variant, which has implications on current prophylactic strategies in development.


Introduction
Respiratory syncytial virus (RSV) is the most common cause of acute lower respiratory tract infection in children aged <5 years worldwide, with an estimated associated mortality of up to 199,000 deaths per year mostly in developing countries (Scheltema et al. 2017; Pneumonia Etiology Research for Child Health Study Group 2019). RSV is also an important cause of community-acquired pneumonia among hospitalized adults of all ages (Dowell et al. 1996). RSV has an enveloped, nonsegmented, single-stranded, negative sense RNA genome of approximately 15,000 nucleotides encoding eleven proteins: NS1, NS2, N, P, M, SH, G, F, M2-1, M2-2, and L (Sullender 2000). RSV clinical isolates are classified into two groups (RSVA and RSVB) based on antigenic and genetic variability (Melero et al. 1997). Distinct genotypes of RSV circulate locally and globally suggestive of rapid global transmission (Bose et al. 2015). The available therapeutic modalities are chiefly supportive, and prophylactic treatment with neutralizing antibodies is effective in reducing morbidity in infants (Neuzil 2016). There is no licensed vaccine for routine use in immunization, however, vaccine candidates and monoclonal antibodies (mAbs) are in advanced clinical trials (Gerretsen and Sande 2017).
We have previously characterized RSV dynamics in coastal Kenya, using the G glycoprotein gene and using whole-genome sequences of RSVA genotype ON1, almost exclusively from samples from pneumonia patients admitted to the Kilifi County Hospital (Agoti et al. 2015a;Otieno et al. 2016Otieno et al. , 2018. From these studies, RSV displays high genetic diversity of locally circulating strains, within and between consecutive epidemics. Furthermore, recurrent RSV epidemics in Kilifi are depicted by sequential replacement of genotypes, over the long term, and high turnover of variants within genotypes in the short term (Agoti et al. 2015a;Otieno et al. 2016). In the current study, samples arise from a design aimed to limit temporal, age-related, illness severity, geographical, and health care access bias. Recruitment was carried throughout a study location, from representative health facilities, simultaneously, and of patients of any age with mild acute respiratory symptoms (Nyiro et al. 2018).
Phylodynamic methods have been used to study molecular epidemiology and evolutionary dynamics of RNA viruses including Ebola, Zika, influenza, and coronaviruses (Faria et al. 2013(Faria et al. , 2017Lemey et al. 2014;Dudas et al. 2017;Zehender et al. 2017;Sironi et al. 2020). However, despite the importance of RSV to pneumonia hospitalization and mortality among children (Pneumonia Etiology Research for Child Health Study Group 2019), there are few equivalent genome-scale studies to examine RSV transmission and evolution particularly within a community setting (Agoti et al. 2015b(Agoti et al. , 2017(Agoti et al. , 2019Otieno et al. 2018;Trovao et al. 2019). While most studies on RSV focus on the G glycoprotein gene because of its high genetic diversity and utility as a phylogenetic marker, genome-wide genetic signatures additionally inform on diversity and the adaptive mechanisms following introduction into the population (Otieno et al. 2018).
We measured genomic diversity, spatial and temporal circulation of RSVB in rural Kilifi, coastal Kenya, from samples collected through outpatient surveillance, analogous to studying community RSV epidemics. We present estimates of rate of evolution and time since the most recent common ancestor (tMRCA) and infer viral population dynamics over two consecutive RSV epidemics in coastal Kenya. In particular, we identify emergence of a novel RSVB variant carrying distinct amino acid (aa) signatures.

Study design and sample testing
RSV is highly seasonal in Kilifi, Kenya, starting from November through May, with a peak around January and an average duration of 18 weeks (Nokes et al. 2009). This study was carried out within the Kilifi Health and Demographic Surveillance System (KHDSS) area (Scott et al. 2012) and used samples collected from December 2015 to July 2017, a period covering two RSV seasonal epidemics (2015/16 and 2016/17). Nine public outpatient health facilities in KHDSS were purposively selected (Matsangoni (MAT), Ngerenya (NGE), Mtondia (MTO), Sokoke (SOK), Mavueni (MAV), Jaribuni (JAR), Chasimba (CHA), Pingilikani (PIN), and Junju (JUN)) to provide a broad representation covering major road networks and variation in population density ( Fig. 1) (Nyiro et al. 2018). Participant recruitment and specimen collection was integrated within the routine patient care as detailed in Nyiro et al. (2018). Written individual informed consent was sought from adult patients and parents/guardians of patients below 18 years. Nasopharyngeal swabs (NPS) were screened for RSVA and RSVB using a multiplex real-time PCR assay system (Hammitt et al. 2011;Kamau et al. 2017). RSV positives were defined as samples with a cycle threshold (Ct) <35.0.
The study was approved by the Kenyan KEMRI-Scientific and Ethical Review Unit (SERU# 3103) and the UK University of Warwick Biomedical and Scientific Research Ethics Committee (BSREC# REGO-2015-6102).

RSVB whole genome sequencing and data assembly
Whole genome amplification and sequencing was attempted for all RSVB positive samples. Reverse transcription and PCR amplification were performed with a six-amplicon, six-reaction strategy presented in detail in Agoti et al. (2015b). Methods for quality checking of the sequence reads, depletion of human reads, consensus genome assemblies and calculation of coverage depth, were as described in Otieno et al. (2018).

Data compilation and sequence alignment
A global dataset was prepared by retrieving RSVB complete genomes from GenBank sampled between 2012 and 2016. Sequences published without date or location of sampling were excluded. For all analyses, sequence alignment was done using MAFFT v.7.221 (Katoh and Standley 2013) and the parameters 'localpair -maxiterate 1,000'. Because of sparse data at the genome termini and in the noncoding regions, only the coding genomic regions were used for all analyses.

Tests of temporal signal
A maximum-likelihood tree of the Kilifi dataset was estimated using IQ-TREE 1.6 (Nguyen et al. 2015). The best-fit substitution model was tested and chosen during the tree building process and the approximate likelihood ratio test was applied to assess the reliability of key branches of the trees (1,000 replicates). To examine the degree of temporal signal or signal of divergence accumulation over the sampling time interval, we first followed a standard exploratory linear regression approach. Root-to-tip distances were plotted as a function of sampling time according to a rooting that maximizes the Pearson product-moment correlation coefficient using TempEst (Rambaut et al. 2016).
We also explored an alternative test of temporal signal using a Mantel test (Murray et al. 2016) to identify whether genetically similar taxa were more likely to have been sampled around the same time. With this method, the significance of the correlation between the root-to-tip distances and sampling times was estimated against a null distribution obtained by reassigning the sampling dates to the taxa 1,000 times. We repeated the Mantel test incorporating a clustered permutation approach (Duchê ne et al. 2015b) where dates were randomly reassigned (1,000 data replicates) among defined clusters of taxa. To identify clusters for clustered permutation, we used the patristic distances generated by IQ-TREE above and plotted a distribution plot of the distances in R (R Core Team 2014) ( Supplementary Fig. S1). A cutoff value (0.0024 nucleotide substitutions per site) determined as the least value between the first and second peaks in the distribution (Supplementary Fig. S1) was used to define clusters as sequences within genetic distance threshold of <0.0024. The Mantel tests were implemented in R using Murray et al.'s scripts (Murray et al. 2016).
The significance of linear regression of sampling dates against root-to-tip distances in the clustered permutation analyses was assessed by a P-value: comparison of the observed data correlation coefficient (r) to the r estimates for the data replicates with dates randomly permuted over the defined clusters.
Further, we complimented the linear regression permutation tests with Bayesian dating permutations done in 100 million steps and sampling every 10,000. This involved creating ten datasets with randomly permuted tip dates as suggested in Murray et al. (2016). The significance of the Bayesian tip-date-informed evolutionary rate was determined by comparing the mean substitution rate estimate from the observed data with the 95 per cent HPDs estimated from the date randomized datasets for which no particular relationship between sampling time and root-to-tip divergence is expected while assuming the same evolutionary models in all the Bayesian dating analyses (Firth et al. 2010).

Bayesian analyses
Time-calibrated phylogenies were done using BEAST v1.10 (Suchard et al. 2018). To model the nucleotide substitution process, the codon positions were partitioned into 1st þ 2nd vs. 3rd positions. The HKY substitution model with a discretized gamma distribution was used to model rate heterogeneity across sites (Shapiro, Rambaut, and Drummond 2006). An uncorrelated lognormal relaxed molecular clock was used to accommodate variation in evolutionary rate among lineages ) and a nonparametric skyride demographic model with time-aware smoothing was selected. The molecular clock rate was set to use a noninformative continuous time Markov chain rate reference prior. The analyses were done in 200 million steps, sampling every 10,000. Stationarity and mixing (e.g. based on effective sample sizes >200 for the continuous parameters) were examined using Tracer version 1.7. The Bayesian dating permutation tests (described above) were done in 100 million steps, sampling every 10,000. Maximum clade credibility (MCC) trees were generated from the BEAST posterior tree sets using TreeAnnotator.

Phylogeny-trait association analysis
For the Kilifi dataset alone, we used the Bayesian Tipassociation Significance (BaTS) software (Parker, Rambaut, and Pybus 2008) to assess the strength of geographic clustering in the posterior tree distribution obtained from BEAST analyses. The overall statistical significance was determined by estimating the parsimony score (PS) and association index (AI) metrics, where the null hypothesis is that clustering by geographic location is not more than that expected by chance. In addition, the maximum clade (MC) size metric was used to compare the strength of clustering at each location by calculating the expected (null) and the observed mean clade size from each study location. A significance level of 0.05 was used in all cases. The PS, AI, and MC statistics were computed for a null distribution with 1,000 replicates.

Selection analyses
Gene-specific nonsynonymous to synonymous substitutions (dN/dS) ratios were estimated using the SLAC method (Weaver et al. 2018). We also investigated episodic positive or diversifying selection using MEME and FUBAR methods. MEME aims to detect sites evolving under positive selection in a proportion of branches, while FUBAR uses a Bayesian approach and assumes that selection pressure is constant along the entire phylogeny.

Sequence data availability
The sequencing reads are available in the NCBI BioProject database under the study accession PRJNA562116 and the genomes generated in this study are available in GenBank under accession numbers MN365302 to MN365600.

RSVB occurrence in Kilifi, 2015-17
Between December 2015 and July 2017, 8,127 NPS samples were tested for RSV, and 503 (6.2%) were positive (C t < 35). Among the RSV positive samples, 95 (18.9%) were RSVA and 408 (81.1%) were RSVB. The frequency and monthly pattern of occurrence of RSVB for each participating health facility are shown in Fig. 2. Overall, the proportion of RSV positive individuals differed by age (P value <0.001) and study location (P value ¼ 0.003) (Supplementary Table S1 Table S1). The peak period for RSV case detections occurred from November to May the following year.

Genome characteristics and relative genetic diversity
Sequencing and data assembly was successful for 299/408 (73.3%) RSVB positive samples. The remaining 26 per cent (109/ 408) were not sequenced at sufficient depth or the read quality was low. The final dataset of the Kilifi RSVB samples consisted of 295 coding-complete genomes. The median genome length was 15,025 (range 11,519-15,257 nt). All the sequenced RSVB viruses belonged to the BA genotype, characterized by the presence of 60-nt duplication in the C-terminal region of the G glycoprotein gene. Genome coverage did not vary by rRT-PCR C t value. Across the genome length, there were 838 consensus level single nucleotide polymorphisms: 554/838 (66%) were parsimony informative, 503/554 (91%) were located within coding regions, and 332/503 (66%) were nonsynonymous. Nonsynonymous changes were higher at the mucin-like domains of G gene; in the N-terminal of fusion (F) gene; as well as in the N-and C-terminals of RNA-dependent RNA polymerase (L) gene (Fig. 3A). Bayesian demographic reconstruction (Fig. 3B) showed seasonal periodicity in relative genetic diversity corresponding with RSVB incidence. Such temporal resolution of changes in the viral population size could imply sufficient sampling density (Otieno et al. 2016). A decline in relative genetic diversity was observed between the two epidemics indicating lineage or variant replacement.

Phylogenetic relationships and spatial structure
We identified six well supported clades based on the phylogenetic positioning of the Kilifi samples in the context of globally sampled RSVB genomes (Fig. 4A). Three clades (II, III, and IV) further segregated into defined temporal sub-clades. The absence of external sequences nested within the Kilifi clades might suggest local persistence and diversification, although we cannot exclude importation events from unsampled locations. Distinctively, clades V and VI solely contained viruses from the 2015/16 and 2016/17 epidemics, respectively, while the other clades contained samples from both RSV epidemics. For each clade, we estimated the duration or persistence based on sample collection dates and the time of divergence (Supplementary Table S2), but these inferences might be biased due to assorted sampling locally and globally.
The inferred clades were spatially disseminated (Fig. 4B), suggesting frequent mixing within Kilifi, and none was restricted to a single study location. We calculated AI and PS values statistics to assess the strength of spatial association with the RSVB transmission patterns in Kilifi. The analysis showed high AI and PS values (Table 1), suggesting strong and significant spatial structuring for each location (P < 0.001). Geographic clustering was significant in at least 8/9 study locations as shown by the MC size values (Table 1). Differences in the observed and expected MC values (Table 1) suggested that Mavueni exhibited the most spatial structure (difference of 8.7) and Mtondia had the least (difference of 0.3).

Temporal signal and molecular dating
The standard linear regression exploration of the Kilifi dataset showed an overall correlation between the root-to-tip distances  and time (correlation coefficient of 0.85), and a clear difference in root-to-tip distances between the two RSV epidemics (Fig. 5A). A Mantel test (Murray et al. 2016) applied to the Kilifi dataset however found evidence of significant confounding between temporal and genetic structures (P ¼ 0.001) and indicated that an alternative approach to date randomization (clustered permutation) should be used to test for temporal structure. For this, we grouped the Kilifi genomes into twenty clusters using the pairwise patristic distance threshold of 0.0024 (see Section 2) and repeated the Mantel test with sampling dates permuted over the twenty clusters (1,000 permutations). The Mantel test after clustering confirmed that our choice of clusters was sufficient to eliminate the confounding (P ¼ 0.98).
To assess the significance of the correlation between phylogenetic root-to-tip distance against sampling time, we performed a linear regression with a clustered date permutation test (100 permutations). The clustered permutation regression test reached significance (P ¼ 0.001), and the r estimate with the observed data was outside the range of r estimates obtained using date randomization ( Supplementary Fig. S2), indicating that the observed correlation between the root-to-tip distances and time differs significantly from what is expected in the absence of a temporal structure (Laenen et al. 2019).
We further evaluated the presence of temporal signal with Bayesian dating permutation implemented in BEAST. Sampling dates were permuted ten times over the clusters defined earlier and the substitution rate estimates from the date-randomized datasets compared to the estimates from the true ordering of dates. The substitution rate of the true observed data was estimated as 9.922 Â 10 À4 (95% HPD: 8.14 Â 10 À4-1.18 Â 10 À3 ) nucleotide substitutions/site/year and did not overlap with HPD intervals obtained from the date-randomized datasets (Fig. 5B). This indicates an appreciable presence of temporal signal and structure in the data. The tMRCA for the Kilifi samples was estimated to 2012 (95% HPD: 2011.1-2013.5).

Amino acid diversity
An alignment of aa sequences showed mutations characteristic to each clade (Supplementary Table S2). In the two RSV epidemics, there were at least 37 and 93 aa substitutions in F and G gene, respectively. In F gene, three aa substitutions (A103V, L172Q, and S173L) were found in antigenic site V in the 2015/16 viruses. These three substitutions were also circulating from 2015 to 2017 in the USA (Bin et al. 2019) and South Africa (Liu et al. 2020), and in both countries, the polymorphisms were detected continuously and in increasing frequency. However, none of the samples from the 2016/17 epidemic in Kilifi had A103V, L172Q, and S173L substitutions probably due to removal by purifying selection. It was suggested that the three substitutions likely emerged in 2014/15 as they were not present in RSV F sequences prior to 2013 (Bin et al. 2019).
Six distinct aa variants-K68Q in F gene; Y90H, L91F, T225N, T273I, and A301T in G gene-occurred together at a frequency >40 per cent in the 2016/17 epidemic and distinctively characterized the sixth clade (Fig. 4A). Importantly, the K68Q substitution is found at the antigenic site 1 of pre-fusion F protein conformation, a binding epitope of mAb MEDI8897 (Zhu et al. 2017

Selection pressure analyses
We estimated higher global nonsynonymous (dN)/synonymous (dS) substitution rate ratios for G and SH glycoproteins than other genes (Table 2). SLAC analyses identified three aminoacid sites (135, 217, and 285) in G gene under significant positive selection (P < 0.1). MEME analyses detected three diversifying codons in the F gene, and eleven in the G gene (P < 0.1) ( Table 2). The FUBAR method identified two codon sites in F gene and seven in G gene, under episodic positive selection with significant support (posterior probability >0.9) ( Table 2).

Discussion
This study provides insights into the genomic diversity of RSVB in Kilifi county, coastal Kenya. We obtained 295 complete genomes from representative sampling across the KHDSS area, over two consecutive RSV epidemics. The two epidemics comprised of multiple co-circulating virus clade introductions, which circulated in all the study locations, suggesting substantial spatial spread and transmission in a relatively short time.
Although RSV surveillance has improved globally, complete genomes data from recent years remains insufficient and may have limited our inference of spatial and temporal placement of RSVB in Kilifi. In the two epidemics, there was a strong spatial structure of the viral population indicating local transmission within the populations neighboring a health facility.
Tip-dated inferences are reliable only if the sequence data exhibits temporal signal Firth et al. 2010). Conventionally, this is based on the fit of linear regression between sampling time and root-to-tip distance, and assuming statistical independence, a significant positive correlation would indicate presence of evolutionary change within the dataset timescale (Drummond, Pybus, and Rambaut 2003;Drummond et al. 2006). However, linear regression tests alone can be misleading when there is substantial rate variation among lineages, non-uniform distributed sampling times (Rieux and Balloux 2016), imbalanced trees and confounding temporal and genetic structures (i.e. closely related sequences arising from sampling at similar times, e.g. during an outbreak) (Duchê ne, Duchê ne, and Ho et al. 2015a; Duchê ne et al. 2015b). Date randomization tests whether the observed tip-date informed estimates deviate from estimates obtained in absence of temporal structure (Duchê ne et al. 2015b). The Kilifi dataset was temporally and genetically confounded. However, the temporal signal in the data remained when the clustered date permutation approach was used confirming detectable temporal structure and assuring reliability of the observed tip-dated substitution rate and tMRCA inferences. Confounding may have arisen naturally from our clinical sampling protocol or from the evolutionary process itself, as suggested previously (Murray et al. 2016). According to Vrancken et al. (2020), intense sampling of closely related sequences produces a rapid succession of coalescent events just before sampling, reminiscent of a panmictic population that is declining in size and in turn biases the evolutionary rate estimate and results in misleadingly recent tMRCA (Vrancken et al. 2020). An evolutionary change in the genetic constitution of a virus population could lead to sequences sampled synchronously being more closely related, for instance in the 'ladderized' Influenza A genealogies, hence inherent temporal and genetic confounding (Murray et al. 2016).
A previous study showed that K68N substitution in F gene affected binding of MEDI8897, an RSV pre-F-specific human mAb under clinical evaluation as a passive immunization of all infants entering their first RSV season (Domachowske et al. 2018;Zhu et al. 2018). It is probable that the K68Q substitution identified in clade VI in Kilifi promoted evasion of pre-existing immune responses. Unexpectedly, in our data, the F gene aa position 68 was not under detectable selection pressure. An explanation would be the low rate of nonsynonymous evolution (conversely, high sequence conservation) at position 68 in our dataset, or immune driven positive selection could not be identified by methods here. In any case conventional approaches for measuring selection pressure consistently detect positive selection only at codon sites with high rates of nonsynonymous evolution (Kosakovsky Pond and Frost 2005).
Our study provides a novel sequence polymorphism (K68Q) within the MEDI8897 binding site with a frequency of nearly 50 per cent in our study population. Additionally, the viruses with the K68Q mutation carried five distinctive amino-acid mutations in G gene, including two consecutive codons (Y90H and L91F). We are not certain whether these codon replacements are due to nonselective epidemiological processes or are compensatory mutations that retain protein function, or hitchhikers carried along by chance (Smith et al. 2004). Still, we cannot exclude the possibility that these are relevant antigenic epitopes.
In conclusion, we present the utility of genomic analyses to investigate virus transmission and genetic diversity including detection of a novel antigenically distinct variant. Further studies are required to determine whether the K68Q mutation is adaptive and/or a result of escape from antibody-mediated Table 2. The predicted nature of selection pressures acting on each genomic region: first column shows the computed mean dN/dS rate ratio using SLAC and the second column shows amino-acid sites in F and G gene under episodic selection as identified by MEME analyses. selection and constitutes a naturally acquired antiviral resistance that disrupts neutralizing antibody recognition and binding. Our study underscores the need for continued genomic surveillance of F and G protein antigenic sites as this has implications on RSV therapeutic and vaccine development. An important future effort for us is to assess if the K68Q mutation has become more prevalent and gradually fixed since the 2016/17 epidemic. Additional sequencing of RSVB from other regions in Kenya and neighboring countries is also essential to refine evolutionary dynamics and draw better conclusions about geographic origins of viral introductions in Kilifi. The present study makes publicly available a large number of newly sequenced RSVB genomes useful for further molecular evolution studies.

Data Availability
The replication data and analysis scripts for this manuscript are available from the Harvard Dataverse: DOI: https://doi.org/10. 7910/DVN/RYISUE. Some of the clinical dataset contains potentially identifying information on participants and is stored under restricted access. Requests for access to the restricted dataset should be made to the Data Governance Committee (dgc@kemri-wellcome.org).

Supplementary data
Supplementary data are available at Virus Evolution online.