DISSEQT—DIStribution-based modeling of SEQuence space Time dynamics

Abstract Rapidly evolving microbes are a challenge to model because of the volatile, complex, and dynamic nature of their populations. We developed the DISSEQT pipeline (DIStribution-based SEQuence space Time dynamics) for analyzing, visualizing, and predicting the evolution of heterogeneous biological populations in multidimensional genetic space, suited for population-based modeling of deep sequencing and high-throughput data. The pipeline is openly available on GitHub (https://github.com/rasmushenningsson/DISSEQT.jl, accessed 23 June 2019) and Synapse (https://www.synapse.org/#!Synapse: syn11425758, accessed 23 June 2019), covering the entire workflow from read alignment to visualization of results. Our pipeline is centered around robust dimension and model reduction algorithms for analysis of genotypic data with additional capabilities for including phenotypic features to explore dynamic genotype–phenotype maps. We illustrate its utility and capacity with examples from evolving RNA virus populations, which present one of the highest degrees of genetic heterogeneity within a given population found in nature. Using our pipeline, we empirically reconstruct the evolutionary trajectories of evolving populations in sequence space and genotype–phenotype fitness landscapes. We show that while sequence space is vastly multidimensional, the relevant genetic space of evolving microbial populations is of intrinsically low dimension. In addition, evolutionary trajectories of these populations can be faithfully monitored to identify the key minority genotypes contributing most to evolution. Finally, we show that empirical fitness landscapes, when reconstructed to include minority variants, can predict phenotype from genotype with high accuracy.

MS2 in 1975 (Fiers et al. 1976); Haemophilus influenzae (Fleischmann et al. 1995); and Mycoplasma genitalium (Fraser et al. 1995) and the study of microbial evolution by phylogenetics has benefited from the hundreds to tens of thousands of consensus sequence genomes available for many microorganisms. More recently, High-throughput sequencing (HTS) technologies have added new depth to sequence data, capable of quantifying minority variants within the population that differ from the consensus sequence. For example, HTS studies of RNA viruses indicate that both experimental and clinical samples present hundreds to tens of thousands of low-frequency variants, constituting single nucleotide polymorphisms at nearly every nucleotide site along the genome (Acevedo, Brodsky, and Andino 2014;Bordería et al. 2015). Even before HTS, phenotypic differences between populations with the same consensus sequence have been observed and attributed to suspected differences in variant composition (Vignuzzi et al. 2006). However, characterization of these mutant 'swarms' has generally been limited to mean measures of overall diversity (e.g. Shannon entropy, mean variance, etc.). In a few cases, examples of mixed populations of single nucleotide variations were shown to contribute significantly to virus pathogenesis (Vignuzzi et al. 2006), fitness, and phenotype (Bordería et al. 2015;Xue et al. 2016), but focused on only a few variants. Since a mixed population can constitute an evolutionary stable strategy (ESS) (Smith and Price 1973;Reiter et al. 2015), the population might aim for an equilibrium where multiple variants coexist.
The rapidly expanding field of single-cell sequencing illustrates how the role of heterogeneity in general can be studied in more and more detail. The data is however complex and noisy, which presents new challenges in the development of algorithms and techniques for analysis, representation, and visualization (Bacher and Kendziorski 2016;Gawad, Koh, and Quake 2016;Svensson et al. 2016;Perkel 2017;Russell, Trapnell, and Bloom 2017). Although phylogenetic tools are well suited for understanding the evolutionary history of lineages and the relationships between lineages/individuals based on whole-genome consensus sequence data, they cannot take into account the variant composition hidden by the consensus. Higgins (1992) circumvents these issues by applying multidimensional scaling (MDS) for exploratory analysis, keeping distances between samples more in line with the measured quantities. PhyloMap (Zhang et al. 2011) superimposes phylogenetic trees on the MDS representation, trying to get the best from both worlds. Relying on consensus sequences only, these models are, however, not well suited for comparison of populations that might be identical at the consensus level but with key differences in the minority variant composition. Other tools are thus needed to adequately represent and visualize a microbial population in sequence space, focusing on where something is, rather than how it got there. Theoretical fitness landscape models, including Wright's (Wright 1932) and the NK landscapes (Kauffman and Weinberger 1989) of Kauffman using two parameters to model the landscape ruggedness paved the way for more recent advances where landscape models are (partially) based on empirical data. One approach is to study the impact of mutations at a few loci only (Whitlock and Bourguet 2000;Collins et al. 2004), thus artificially enforcing a low dimension of sequence space. To expand the fitness landscape analysis to a higher dimensional setting, Kouyos et al. (2012) utilized predictive models for in vitro fitness based on the amino acid sequence. For RNA viruses, the mathematical framework provided by the quasispecies theory has been used to describe the population dynamics of these pathogens (Biebricher and Eigen 2006). Seifert et al. (2015) assumed that viral populations reached mutation-selection equilibrium and applied the quasispecies equation to infer fitness values for the haplotypes in a swarm. However, it is generally accepted that mutation-selection balance is not reached throughout most stages of infection and under most experimental conditions. Nevertheless, it is tempting to think a proper analysis of an evolving population may foretell whether and where the population will move in genotypic space (Stapleford et al. 2014). We believe that to attempt such an analysis, in an accurate and unbiased manner, we need to consider the mutant swarm (a population of closely related viral particles) in its entirety, rather than just the consensus sequence (the most frequent residue at each position) or other oversimplifications of the complex population structure. Minority variants (variant present in the swarm, but not in the consensus sequence) should be included in the analysis and not subjected to an arbitrary cutoff frequency. Viruses, the fastest mutators with small genomes, make ideal model organisms for studying short time-scale population dynamics and will thus be used to showcase the methods developed in this work.
Here, we present DISSEQT (DIStribution-based SEQuence space Time dynamics)-a pipeline for analyzing evolution of microbial populations. At the core is the ability to accurately represent the genetic heterogeneity of a mutant swarm, by modeling the population as a distribution over sequence space, thus making it possible to describe similarities and differences between populations down to the minority level, and to couple sequence space composition to phenotypic effects. We demonstrate the DISSEQT pipeline with examples from RNA virus evolution. First, we show how the DISSEQT sequence space model can uncover biologically relevant features. Second, we followed the evolutionary trajectories of longitudinal samples of experimentally evolved viral populations. Finally, by developing a fitness landscape model based on empirical fitness measurements, we demonstrate how phenotypic effects can be predicted from the population composition. Specifically, we show that the sequence space in which viral populations evolve are of relatively low dimension, and that biologically relevant signals can be readily captured and used to identify the key variants contributing most to phenotype. We confirm that minority variants contribute significantly to phenotype and must be taken into account for accuracy of genotype-phenotype prediction.

Overview of the DISSEQT pipeline
The DISSEQT pipeline (Fig. 1, top panel) is designed for reproducibility and openness, from the ground up, using modern software solutions. The source code is openly available in GitHub and all software dependencies are open source. The software can either be installed locally or run directly from Docker images with all required software preinstalled. Running from Docker images simplifies setup and improves reproducibility since differences between local runtime environments are eliminated.
The overview described here is detailed in Section 4. The DISSEQT pipeline has three steps, serving different purposes. (1) Establishing a model for sequence space. (2) Reducing noise to make the model robust.
First, the raw reads for each sample (mutant swarm) were aligned iteratively until the consensus sequence converged and both automatic and manual quality controls were performed.
Maximum likelihood estimation was used to infer the codon frequencies for each position, using all reads overlapping that position, based on a multinomial model with noisy observations (due to sequencing errors). An initial sequence space representation was then constructed using the codon frequencies and a limit of detection estimated for each possible variant at each site, since sequencing error rates depend on the nucleotide context (Laehnemann, Borkhardt, and McHardy 2016) and the type of substitution (Fox et al. 2014). In this article, we focus on coding regions, which makes codons the natural basis for sequence space modeling, since they are closely connected to biological function and this choice does not impose any assumptions about the relative importance of synonymous versus nonsynonymous changes. All methods presented here are also applicable to non-coding regions, by basing the sequence space model on nucleotides rather than codons.
Second, a dimension estimate of the data was obtained by generating a Talus plot (Fig. 1, bottom left panel, and Supplementary Material), after which noise reduction was performed by SubMatrix Selection Singular Value Decomposition (SMSSVD) (Henningsson and Fontes 2019). SMSSVD is ideal for situations where complex data containing a very large number of variables have signals spread out over different (possibly overlapping) subsets of variables, with the goal of recovering all signals that can be detected, rather than only the strongest one.
Finally, the resulting sequence space representation was used for visualization and phenotype prediction. The evolutionary trajectories of viral populations were followed through time, using sparse methods to find low-frequency minority variants arising and driving the movement of the population in sequence space. Empirical fitness values were used to create fitness landscapes for prediction, using the population representations from Step 2, and for visualization, after an additional nonlinear dimension reduction step vital for getting a useful representation in 2d. The sequence space model created by the DISSEQT pipeline is also intended to be used as input to other software packages, e.g. for clustering and regression.  Representation: Per sample raw sequencing data is passed through automatic quality control and aligned to a reference genome. Codon frequencies are inferred using quality scores in the aligned data and the limit of detection is estimated for each codon at each site. These are combined to form the sequence space representation. Consensus change reports and read coverage plots aid manual quality inspection. We generated three additional synthetic synonymous (SynSyn) virus lineages (Fig. 2), some of which were previously published (Moratorio et al. 2017), in which these 117 codons were changed to belong exclusively to only one of three codon categories. These lineages were designed to retain the initial functional neutrality (i.e. same protein sequence), while occupying different starting points and potentially different trajectories in sequence space. Indeed, the differences in fitness values and phenotypes are small in comparison to the differences we observe within the lineages in the experiments described below (see Supplementary Fig. S1). However, the lineages should behave differently as mutations accumulate at these codons, by accessing different mutational neighborhoods with differing impacts of virus fitness. Next, to introduce changes in minority variant composition without significantly altering consensus sequences of the mutant swarms, we evolved these virus populations in different conditions. Wild type and SynSyn viruses were serially passaged five times in triplicate in normal conditions, as well as in five different mutagenic conditions that are known to increase this virus's mutation rate (Beaucourt et al. 2011) three base analogs (ribavirin, 5-fluorouracil, and 5-azacytidine), amiloride, and Mn 2þ . Low to moderate concentrations were used to accelerate evolution, while higher concentrations were employed to exacerbate fitness effects. We thus obtained 411 mutant swarms (301 passing strict quality controls) from these varied growth conditions, which were deep sequenced to obtain their entire variant compositions. Importantly, passaged samples in each lineage did not have significant consensus changes (in total across the samples, 144 substitutions at 4 different receptor binding sites and 35 substitutions at 12 other sites).

DISSEQT reveals that the sequence space occupied by evolving viral populations is of intrinsically low dimensionality
Theoretical sequence space is incredibly large, even for a small genome of length n ¼ 10; 000, the number of possible sequences is 4 n % 4 Á 10 6020 , so large that the number of atoms in the universe is miniscule in comparison. The number of sequences reachable within just K ¼ 10 mutations, P K k¼1 n k À Á k 3 % 1:6 Á 10 38 is still vast, and it is unknown how much of sequence space is occupied by an evolving viral population. We generated Talus and Projection Score (Fontes and Soneson 2011) plots from the sequence data, which provide a visualization of how the contents of a data set spread out across different dimensions. These plots provide a qualitative estimate of the number of dimensions needed to capture all biologically relevant signals that stand out above the background noise. As shown in Fig. 1, bottom left panel, the Talus plot settles after thirteen dimensions, with small variations around a low mean, giving a dimension estimate of thirteen. In the Projection Score plot (Fig. 1, bottom right panel), SMSSVD has detected three signals, of dimensions 3, 5, and 5, where the variance filtering threshold for automatic noise reduction has been optimized for each signal. Next, we examined which biological signals were captured in each dimension and whether incorporating minority variants in the analysis allows for improved monitoring of evolving populations, compared to consensus sequence analysis. To do this, sequence space representations of the mutant swarms were generated after noise reduction, where the final SMSSVD step decomposed the samples by principal components. Since almost no consensus changes occurred during the experiment, the principal components found patterns essentially related to differences in minority variants between mutant swarms. As shown in Fig. 3, the strongest signal, described by the first three principal components, clearly separates the samples in sequence space according to lineage (see Rows 1-3, above the diagonal, in Fig. 3). Importantly, further analysis of lower dimensions identified all biological treatments that were imposed on the viral populations. A complete separation in sequence space was observed for mutagenic treatment by 5fluorouracil, ribavirin, and 5-azacytidine (see Rows 4, 5, and 7 below the diagonal, in Fig. 3), known to introduce specific nucleotide substitution biases. Even for treatment with Mn 2þ and amiloride, which increase natural mutation rates without introducing nucleotide bias, a biological signal could be identified in most of the mutant swarms separating from other samples in Rows 9 and 11 (Fig. 3). Furthermore, these signals are detected despite the background noise and error introduced by the sample preparation and sequencing technology, which lies in even lower components. Finally, if the same analysis is performed using only each sample's consensus sequence, no patterns related to the mutagens is found ( Supplementary Fig. S2). These results reveal an important feature of evolving mutant swarms: despite the fact that deep sequencing data is extremely highdimensional, we showed here that evolving populations of RNA viruses, which present the highest mutation frequencies, are moving in an intrinsically low dimensional domain within sequence space. Indeed, all five of the biological pressures placed on these viral populations could be captured within the first thirteen components.

DISSEQT can monitor evolutionary trajectories and identify the minority variants involved in adaptation
Recently, we studied the adaptation of Coxsackie virus to a new cell line. Long term passages of experimentally evolved populations (120 generations per virus) were analyzed by deep sequencing. Lacking suitable computational tools, the original study focused on identifying variants in the structural proteincoding region of a wild type lineage that showed signs of positive selection in the final passages of adaptation (mutations appearing at >2% in more than one replicate, and only in the structural proteins known to be involved in adaptation to cell culture) (Bordería et al. 2015). In that study, we identified one consensus sequence change that occurred in all lineages during the first ten passages, followed by a cluster of minority variants that reached above 5 per cent in the last passage in each series. The data set however, contained whole-genome sequencing for three lineages of this virus: wild type, a higher replication fidelity lineage, and a lower fidelity lineage. Using DISSEQT, we could obtain a more complete picture by monitoring the evolutionary trajectories of three biological replicates per lineage (Fig. 4), without biasing toward nonsynonymous mutations in the structural protein region. The top panel gives an overview based on nonlinear dimension reduction, showing how the evolutionary trajectories of the replicates relate to each other. For each pair of replicates, the time of bifurcation was computed and this was extended to sample clusters using average linking hierarchical clustering. Before the time of bifurcation, the replicates are close in sequence space and follow the same evolutionary trajectory. The splits in the panel show when the bifurcations occur. All replicates shared the same starting point. Around Passage 4, the low fidelity replicates (yellow-orange) split from the others and shortly thereafter (around Passage 5) the wildtype replicates (magenta-purple) split from the higher fidelity replicates (green-cyan). These observations reflect what was expected, but could not be detected using classical approaches that monitored only a few positively selected alleles: that low fidelity, mutator strains generated more minority variants more rapidly compared to wildtype, and to high fidelity strains. The replicates within each lineage then followed similar trajectories until further bifurcating between Passages 7 and 19. As with the previous examples where lineages clustered together, these results also support the notion that although sequence space is theoretically huge, similar lineages will tend to travel along the same evolutionary trajectories during the initial periods of evolution.
While the above analysis gave information on rate and direction of evolution, it did not identify what minority variant component of each population contributed to adaptation  structural proteins responsible for adaptation to receptor usage (Bordería et al. 2015). The remaining components, however, identified several other mutations at sites that were missed by using the commonly used cutoff of 1-2 per cent minority variant frequency reflecting the expected sequencing error rate, and that could explain subtler phenotypic differences between lineages and between replicates. For wildtype, for example, two additional amino acid changes in the VP1 and VP4 structural proteins contributed most to these lineages' departure from others (principal component 2). Finally, the lower components (4 and onward) revealed variants that explain each replicate's divergence from others, including many variants in nonstructural proteins such as the 2C (helicase) and 3C (protease) (Supplementary Fig. S3). Together, the results show that while low-frequency variants were identified at nearly every nucleotide site, the common biologically relevant signals arising during longer term evolution can be captured in relatively low dimension.

Visualization of evolution along an empirical fitness landscape
In RNA virus evolution, adaptation to new environments can often be attributed to single or few new mutations that become fixed in the population. Experimental evolution in the lab and convergent evolution in the field suggest that short term evolution may be of relatively low dimension, as supported by our findings. If so, then these initial movements in sequence space may be inherently predictable, provided a robust genotype-phenotype map could be generated. This connection between sequence space and fitness is most naturally illustrated as a fitness landscape, where fitness is shown as a function of location in sequence space. However, reconstructing such landscapes from empirical data has been challenging. To evaluate the ability of DISSEQT to correctly generate and visualize fitness landscapes, we first empirically measured the relative fitness of the wild type and SynSyn mutant swarms described above in a direct competition assay against a neutral, genetically marked competitor (Carrasco et al. 2007;Moratorio et al. 2017) (data available in Synapse). The visualization (Fig. 5, top panel) builds upon a 2d representation of sequence space, but using only the first two components from the SMSSVD representation is not sufficient since it ignores all other relevant signals in the data. Nonlinear dimension reduction by Isomap was used to distort sequence space such that the notion of closeness between mutant swarms is respected, taking all signals into account. Fitness was then added as the third dimension, interpolated by the Gaussian Kernel Smoother predictor (performance measured in Fig. 6, top panel). The figure shows the dynamics of the mutant swarms corresponding to each viral lineage evolving over time. The wild type lineage (black) occupied the centermost area of the landscape, surrounded by the other lineages. In general, wild type populations occupied high fitness regions of the landscape, with some variability. This observation confirmed that wild type virus is well adapted to the growth conditions used in these experiments, and should tolerate perturbations in the system, such as increases in mutational load. The green SynSyn lineage displayed the most dramatic fitness differences, reaching both very high and very low areas, whereas the blue SynSyn lineage showed a stable, plateau-like behavior without any significant drops in fitness. Finally, the red SynSyn lineage was stuck in an area of the fitness landscape without any fitness peaks. Indeed, the red lineage was shown to be attenuated in vivo, and unable to reach pathogenic outcomes available to wild type virus; while the blue lineage was shown to be more mutationally robust (Moratorio et al. 2017). Supplementary Fig. S4 shows the same fitness landscape, but with samples colored by mutagen. Importantly, the data show that 2d reconstruction of sequence space by nonlinear dimension reduction can adequately reconstruct a fitness landscape that captures the expected biological behavior of similar, yet different viral lineages.
2.6 Prediction of phenotype from genotype requires the input of minority variants A prime goal in developing faithful representations of sequence space is the potential to assign phenotypes to known genotypes, and ultimately predict the phenotypes of new genotypes. For rapidly evolving populations, the presence of minority variants has been shown to contribute to phenotype, but this is not normally taken into account in genotype-phenotype mapping. Indeed, when the fitness landscape described above was reconstructed using only consensus sequence data, the landscape is considerably collapsed (Fig. 5, bottom panel).
We thus evaluated the relevance of our sequence space reconstructions (after noise reduction) in their ability predict virus fitness, a quantitative parameter often used to describe phenotype. The performance of different fitness models was compared (Fig. 6). Predicting fitness is inherently difficult. Thus, to get a baseline for the optimal performance that could be achieved, we used group-based predictors that rely on sample conditions, rather than deep sequencing data. The fine-grained group predictor using Lineage, Mutagen, and Dosage accurately described the sample conditions (Fig. 6, turquoise bar). In other words, when these three groupings are known for a sample, the prediction is over 69 per cent accurate. When only lineage and dose were considered, prediction was 36 per cent accurate, and if only lineage and mutagen were known, accuracy dropped to 11 per cent. For the landscape predictors based on the 2d Isomap, accuracy was 44 per cent. SMSSVD, on the other hand, which uses 13d reach predictability of 62 and 61 per cent from landscape or nearest neighbor predictors. The data revealed that while 2d Isomap performs well for visualization, prediction is best achieved when more components are incorporated. Importantly, when either Isomap or principal component analysis is performed solely on consensus sequences, prediction fails (2 and 14%, respectively). Furthermore, the performance of the SMSSVD predictors compared well to the predictor based on experimental conditions (Lineage, Mutagen, and Dose), the closest we have to a gold standard. In summary, the predictors based on our proposed sequence space representation vastly outperformed the consensus-based predictor. The data thus confirms that consensus sequencing of a viral population is not enough to accurately predict its phenotype.

Discussion
HTS is replacing more classic sequencing methods in microbiology, especially in studying RNA viruses, where every nucleotide can be easily covered with extreme depth. This has increased and renewed interest in better characterizing RNA virus populations to take into account their variability, particularly when trying to identify differences between clinical or experimental samples that have no significant differences in consensus sequence, yet present different phenotypes. Recent works show that indeed, most sites along a genome generate mutants at very low frequency. Following passage of poliovirus in cell culture, Acevedo, Brodsky, and Andino (2014) identified an average of 16,500 variants, the equivalent of $74 per cent of all possible variant alleles in each passaged sample. Similarly, the previous analysis of the Coxsackie virus B3 wildtype populations described in more detail here, identified variant alleles in 65-80 per cent of the sequenced regions (Bordería et al. 2015).
Despite the increasing accessibility of sequencing technology, we still lack the computational tools to use this data to its full potential. For instance, while an exhaustive list of variants can be generated per sample, to differentiate between similar, yet different, populations most studies have had to settle with using very basic mean measures such as Shannon entropy or mean variance. At best, these were followed up by a more targeted (and biased) focus on the few alleles suspected or known to be involved in the biological question being addressed.
A pre-existing obstacle to developing these tools was the uncertainty as to the size and dimensionality of sequence space actually occupied by evolving microbial populations. Mathematical sequence space is vast, even for the small genomes of RNA viruses. Theoretically, the high mutation rates of RNA viruses could reach a large amount of this space, questioning whether the evolution of these microbes could be inherently predictable. However, it is clear that biological constraints prohibit this from occurring, as most mutations will affect form or function and will not accumulate under strong purifying selection. In vivo and in vitro experimental evolution studies performed in independent replicates reveal that under a constant environment, the same set of mutations tends to emerge. This suggests that the sequence space available to a virus is indeed more limited, determined by its current genome sequence, raising the possibility that evolutionary trajectories may therefore be predicted at least in the very short term (the next one or few mutational steps). Fitness landscapes help understand what neighboring populations might represent distributions of genotypes of equal or increasing fitness and which regions define populations of lower fitness. Knowledge of fitness in the vicinity of a current population may help determine the most likely paths that will be taken during the evolution of the population. While this goal may seem lofty for large genomes, the small and highly constrained genomes of RNA viruses may be more amenable to such an exercise.
We have shown how the DISSEQT pipeline, using distribution-based modeling of complex, evolving viral populations, can uncover many different genotypic and phenotypic patterns without needing a priori hypotheses of which genetic alleles are to be studied. Importantly, the robust dimension reduction methods performed here have successfully separated biologically relevant signals from sequencing-related error and other noise, identifying key characteristics of the mutant swarm that drive evolution. This accentuates that global properties, like the 'shape' of the mutant swarm, are significant when trying to predict viral evolution. Sequencing error has long been an issue with characterizing microbial diversity and identifying true single nucleotide variants. Despite the presence of sequencing error, DISSEQT succeeded in finding structure in sequence space, made clear by the co-localization of populations subjected to similar environmental conditions and by accurate fitness predictions and fitness landscapes constructed on top of the sequence space representation.
Applied here, DISSEQT analysis has provided two key pieces of information regarding evolving RNA virus populations. First, that the biologically 'relevant' sequence space occupied by such populations is of intrinsically low dimension. In both data sets presented here, the SynSyn viruses that were manipulated to present discrete biological signals and the High-, Low-and Wildtype fidelity viruses evolving naturally to generate discrete differences in variant composition, the genetic signatures of biological interest were segregated and identified within an intrinsic space of very low dimension (10-20). Second and most importantly, we show that reliable prediction of phenotype from genotype requires the input of minority variants, underscoring the importance of studying RNA viruses, and perhaps other microbial organisms, as a population rather than as a single reference sequence.
At the core of our model is the representation of a population as a measure over a suitable genetic space. Using traditional bulk experimental techniques, averaging is performed already in the sampling and measuring steps of the management and analysis protocol; often resulting in relatively robust and normally (or log-normally) distributed data, well adapted for well-established statistical and machine learning analysis and visualization techniques.
Existing quasispecies reconstruction methods are fundamentally different from DISSEQT in that they will produce a small set of haplotypes and ignore the complete codon distribution at each locus. Haplotype reconstruction based on short sequence reads is a mathematically ill-conditioned problem in the sense that a small change in input data can give a completely different result and it is not feasible to reconstruct haplotypes that exist at low frequencies. The DISSEQT pipeline is designed to be able to handle situations where the signal-tonoise ratio is modest and is able to use information on lowfrequency variants to get a more complete representation of the mutant swarms.
We have shown that by directly modeling and representing the distribution at each genetic locus of all measurable minority variants, followed by model reduction, we get low dimensional and robust models that capture the interaction between minority variants and, by coupling it with phenotypic measurements, make it possible to follow and predict trajectories in genotypephenotype space. It opens up for extending the sequence space models presented in this work to situations where heterogeneity of populations can be hypothesized to be an important aspect that can be measured in a direct manner. In particular, data coming from single-cell sequencing have more variability, more artifacts, and often complex distributions (Bacher and Kendziorski 2016) and distribution-based modeling can be envisioned to be viable and provide a natural and biologically accurate representation of the data.
Cancer growth, fundamentally different in origin from viruses and bacteria, may still be usefully described in terms of similar evolutionary processes (Stratton, Campbell, and Futreal 2009). Larger genomes and frequent structural variation, such as chromosomal aberrations (Hanahan and Weinberg 2011) and fused genes (Lilljebjö rn et al. 2016) does, however, make the situation more complex and further work is needed to adapt the sequence space modeling for these circumstances. The challenges lie in incorporating structural variants into to the underlying space in a way that preserves biological similarity and is feasible to infer from the data. A possible starting point for cancer data is to restrict the analysis to a chosen set of interesting genes that do not exhibit any structural variation, thus simplifying the collection of deep sequencing data and providing an easy fit to the sequence space models we propose.

Experimental setup
The experimental setup is described in detail in Moratorio et al. (2017) for the SynSyn data set and in Bordería et al. (2015) for the time series data on adaptation of Coxsackie virus to a new cell line.

Reproducible and traceable analysis
Traceability in the DISSEQT pipeline is provided by integration with the collaborative science platform Synapse. Every result produced by DISSEQT can be traced back all the way to the original data files using the Synapse provenance graph, which describes the actions taken for every analysis step and connects input to output data. Sharing settings in Synapse makes it possible to open up the entire analysis to the public, but keeping sensitive data and unfinished analyses private if necessary. The analysis steps are self-contained in the sense that all data required to produce the output is downloaded from Synapse as needed. Hence, every analysis step can be reproduced locally by anyone executing the same actions. By changing parameters or making other changes, the impact of performing the analysis in a different manner can be investigated by others. Rerunning the entire analysis is also possible in this way. Furthermore, the analysis can be adapted to new data sets, such that the results can reproduced from new biological data.

Iterative alignment
For each mutant swarm, the sequenced reads were aligned to the reference genome of the lineage (WT or one of the SynSyn's) using BWA-MEM (Li 2013). The choice of alignment tool is not critical, but the same one should be used for all samples (mutant swarms) to get a consistent analysis. After alignment, the consensus sequence of the aligned sample is computed. If the consensus differs from the reference genome, the alignment starts over, now using the consensus as the new reference genome. This process is repeated until the consensus does not change. Iterative alignment combats an inherent problem that occurs when aligning to a reference genome-there will be a bias since reads that match the reference genome are easier to align, while reads that differ might be mapped incorrectly or cut off such that the variant is not included in the alignment. For changes at the consensus level, iterative alignment thus ensures that more reads are mapped correctly, allowing for a better frequency estimate. Even more important is that the ability to detect minority variants in the vicinity of consensus level changes is greatly improved, as the number of differences between reads containing the minority variant and the consensus will tend to be lower.

Quality control
Generating deep sequencing data is a complex procedure with many steps performed, both for the experiment itself and to prepare the data for sequencing. The DISSEQT pipeline provides several ways to evaluate the data to make sure that it is of high quality. Before alignment, adapters and poor quality bases are trimmed from the ends of reads using fastq-mcf (Aronesty 2011). At the end of the iterative alignment procedure, consensus sequences are automatically generated for all samples. It is expected that the consensus sequence will be more similar to the reference of the sample lineage, than to the reference of any other lineage used in the same sequencing run. If this is not the case, the sample is flagged as being mislabeled. Indels are also reported. Graphs showing the read coverage as a function of genome position are created. All samples from the same lineage and sequencing run are put in one graph, making it possible to identify problems with low read coverage for certain samples or genomic regions at a glance. Samples with a low mean read coverage can be removed automatically from downstream analysis. What threshold to use depends on the experimental setup, but we recommend keeping only samples with a mean read coverage above 1,000 for deep sequencing data. There are also tools in DISSEQT to remove samples that are suspected of being contaminated by other samples, identified by having a mixture of reads that are likely to originate from different reference genomes. The purpose of quality control is to validate that we are indeed studying what we set out to study. If a sample is showing unexpected patterns, in particular during quality control, we recommend that the aligned reads, the consensus sequence and any other measurements are inspected manually ensure that conclusions are not drawn from faulty data.

Haplotypes
Recovering the haplotype mix from a collection of short reads is a difficult, often ill-conditioned, and computationally intensive problem, but several software tools (Zagordi et al. 2011;Prosperi and Salemi 2012) are available (also see Beerenwinkel et al. 2012;McElroy, Thomas, and Luciani 2014) for overviews. The dominant haplotypes and their frequencies do not, however, completely characterize the viral population, another important aspect is how dispersed the individual viruses are around these central haplotypes. V-Phaser (Macalalad et al. 2012), V-Phaser 2 (Yang et al. 2013), and ShoRAH (Zagordi et al. 2011) find phased variants, pushing down the detection limit by assuming that real variants (at nearby loci) tend to co-vary, while errors do not. Unfortunately, V-Phaser and ShoRAH do not scale well for large data sets and V-Phaser 2 requires paired-end reads. For the reasons above, we chose the simpler and more robust path of making maximum likelihood (ML) estimates of the variant frequencies at each position, based on base quality data.

Sequence space representation
The genomic composition of microbial populations can be represented by a positive measure over a suitable space. Let R be an alphabet set, e.g. the set of nucleotides R ¼ N :¼ fA; C; G; Tg, the set of codons R ¼ N Â N Â N or the set of amino acids R ¼ A ¼ fA; R; N; . . .g. For the rest of this article, the set of codons will be used as the alphabet set, since the codons are closely connected to biological function and this choice does not impose any assumptions about the relative importance of synonymous versus nonsynonymous changes. The set of codons is the natural choice for coding regions, to analyze non-coding regions, the set of nucleotides could be used instead. Now define sequence space R n as the set of sequences of length n over the alphabet R. Assuming that individual genomes in the population only differ by a finite number of point mutations (i.e. substitutions), the composition of the population is characterized by a positive measure over sequence space. The space of positive measures over sequence space will be denoted by PðR n Þ.
Inference of the population composition can be intractable from sequencing data due to short reads and/or high error rates. Let P; Q 2 PðR n Þ and define an equivalence relation such that P $ Q iff PðC i ½xÞ ¼ aQðC i ½xÞ; 8x 2 R; i 2 f1; 2; . . . ; ng; for some constant a 2 R þ , where C i x ½ :¼ fs 2 R n ; s i ¼ xg are the basic cylinder sets of R n . Hence, P relates to Q if they have the same allele frequencies at all positions. Inference for the equivalence class ½P from sequence data is possible even when P cannot be inferred since allele frequencies at different positions can be estimated separately. The drawback is that minority variant linkage is lost, since we do not attempt single viron haplotype reconstruction. Each equivalence class ½P is naturally represented by the frequency matrix p 2 R nÂjRj with p i;x ¼ PðC i ½xÞ=PðR n Þ. Finally, the frequencies are transformed by p ! log 2 ðp þ aÞ, where a denotes the limit of detection, to give minority variants higher impact in the model. The log transformation emphasizes relative differences in frequencies between variants instead of absolute differences in frequencies between variants.

Sequence space inference
Maximum likelihood estimation was used to infer the codon frequencies at any given position, using all reads overlapping that position, based on a multinomial model with noisy observations. At a given locus, let h ¼ ðh 1 ; h 2 ; . . . ; h 64 Þ be the frequencies in the population for the sixty-four different codons, with h i ! 0 for all i and P i h i ¼ 1. Consider a read and the fragment the read is sequenced from. Now let x be the observed codon in the read and z the unknown codon in the original fragment, then PðxjzÞh z : We model PðxjzÞ using the quality scores of the bases in the codon. If 1 ; 2 ; 3 are the probabilities of a read error at bases 1, 2, and 3 in the codon and y k is the base at position k in a codon y, then where d a b is the Kronecker delta, the errors are thus assumed to be independent between bases in the codon and read errors are assumed to be equally likely to result in any of the other three bases. Assuming independent reads, the probability of the observations is which is maximized numerically. We noted that in our high read coverage data, bases with low-quality Phred scores tended to be biased toward certain nucleotide errors. Thus, we chose to not trust bases with a Phred score below 30. This was done by setting the k of such nucleotides to 0:75, giving them no influence. Reads were excluded from the analysis if they caused the ML optimization problem to be underdetermined (e.g. when observing two reads with codons AAA and xAT respectively, where x means that the nucleotide is unknown, xAT is dropped since only the sum of the frequencies for AAT, CAT, GAT, and TAT can be determined).

Limit of detection
True minority variants can be hard to separate from sequencing errors. And in both cases, we expect the frequencies to be different depending on the nucleotide neighborhood and other factors (DePristo et al. 2011). A key difference is however that there are two sets of observations of the sequencing errors since the reads originating from the forward and reverse strands have different nucleotide neighborhoods for any given codon site. Indeed, for each sample, the codon frequencies from the two strands are expected to be approximately equal for true minority variants, something which is much less likely for sequencing errors. The differences in sequencing error behavior depending on the context thus lead us to estimate the limit of detection a separately for each locus and codon. For a given locus, the samples are grouped by run and consensus codon, to get similar sequencing errors across the samples in each group. Fix a codon and let f and r be two vectors where f i and r i are the inferred codon frequencies using reads from only the forward and reverse strands respectively, for sample i in the group. To limit the impact of sequencing errors on the downstream analysis, the transformed frequencies should be approximately equal, i.e. give a low value of the norm w a ð Þ ¼ jjlog 2 f þ a 1 Þ À log 2 r þ a 1 Þjj RMS À À where log 2 acts elementwise and 1 is a vector of all ones. Now define the limit of detection The infimum exists since w is continuous and w t ð Þ ! 0 as t ! 1. The threshold log 2 ð1:5Þ is chosen such that if we have a single sample with f 1 ¼ x and r 1 ¼ 0, then a ¼ 2x. Furthermore, w is a strictly decreasing function and a can thus be found by the bisection method or other root-finding methods. Finally we choose a conservative estimate of the limit of detection a c;x , for codon c at locus x, by taking the highest limit of detection estimated from the different sample groups 1; 2; . . . ; G, a c;x :¼ max 10 À3 ; a c;x ð1Þ ; a c;x ð2Þ ; . . . ; a c;x ðGÞ n o ; with upper indices denoting the sample group and where 10 À3 is a commonly accepted lower limit of detection for sequencing data (Goodwin, McPherson, and McCombie 2016).

Dimension estimation using Talus plots
The Talus Plot provides a visualization of how the contents of a data set spread out across different dimensions and is designed to make it as easy as possible to make a qualitative estimate of the number of dimensions needed to capture all signals that stand out above the background noise. In Supplementary Material, we show how predictable aspects of the background noise can be used to discern signals from noise. In brief, when the Talus Plot has 'settled', with small variations around a low mean, then the noise can be expected to be dominant.

SMSSVD
SMSSVD (Henningsson and Fontes 2019) is a parameter-free dimension reduction technique designed for the reconstruction of multiple overlaid low-rank signals from a data matrix, corrupted by noise. It is ideal for exploratory analysis of complex data, where different signals are spread out over different (possibly overlapping) subsets of variables, by limiting the influence of noise in variables that are not contributing to the signal. One of the major benefits of SMSSVD is its ability to detect signals with a low signal-to-noise ratio. SMSSVD shares many relevant properties with SVD, in particular orthogonality between components and the ability to extract variable loadings. The DISSEQT pipeline uses SMSSVD for noise reduction of the sequence space representation, since the number of variables is very large and we are trying to recover all signals that can be detected, not only the strongest one. Before applying SMSSVD, the data matrix is centered. In practice however, we cannot expect a perfect fit of the surface. Differences in fitness between sample points that are close in the low dimensional representation will be difficult to capture. Furthermore, measurement noise will impact the reproducibility of the surface. To get a robust fitness landscape, we use a Gaussian Kernel Smoother (Hastie, Tibshirani, and Friedman 2009) and select the kernel width r by crossvalidation (repeated random subsampling). That is, we numerically find

Fitness evaluation
The Gaussian Kernel Smoother (fitness landscape) predictors are evaluated in comparison to other fitness predictors. Nearest Neighbor predictors uses the fitness of the closest sample in sequence space as the prediction and can be used for different sequence space models. In case of ties, the prediction is taken as the average over the tied samples. Group-based predictors use a predetermined grouping of the samples, predicting fitness as the average fitness among samples in the same group, and do not use sequence data at all.
Model accuracy for a predictor f is measured by fraction of variance explained, where x i ð Þ is the representation of sample i used by the predictor, y i ð Þ is the fitness of sample i, Ày the mean fitness over all samples and the second term in the expression is the variance of the residuals divided by the total variance. The models are evaluated by leave-one-out cross-validation. The kernel widths for the Gaussian Kernel Smoother predictors are estimated separately for each problem instance to avoid influence from the left-out sample.

Variable selection
We use SPC (Witten, Tibshirani, and Hastie 2009) for variable selection, after noise reduction by SMSSVD. SPC adds a variableside L 1 (lasso) constraint to a formulation of SVD as an optimization problem, forcing sparsity by ensuring that many variables are 0 at the optima. The optimization problem is then solved for one component at a time, using an iterative algorithm. However, since the optimization problem is not necessarily convex, the algorithm might converge to local optima. To reduce the impact of this problem, and to ensure that the singular values are declining, we suggest an extension of the algorithm. It can be shown that if a component has a larger singular value than a previous one, then this solution is guaranteed to be a better starting guess for the optimization problem for the previous component. By rolling back and restarting the optimization at the previous component, we get closer to the globally optimal solution and make sure that the singular values are declining.

Nonlinear dimension reduction
By dimension reduction, we aim to identify the parts of sequence space that are explored by the samples. Linear dimension reduction techniques, like SMSSVD, are useful because they make very few assumptions about the structure of the data. Although there is no reason to believe that the underlying manifold is linear, the complexity that is necessary for biological systems is indeed often caused by nonlinearities, linear methods can still capture nonlinear patterns if the dimension is sufficiently high (Nash 1956). However, to get an informative visualization in just two or three dimensions, nonlinear dimension reduction is needed for complex data sets. We apply Isomap (Tenenbaum, De Silva, and Langford 2000) to the data set after the noise reduction by SMSSVD (and the optional variable selection). Nonmetric MDS using Kruskal's stress criterion (Kruskal 1964) was used rather than classical MDS in the final step of the Isomap algorithm. This distorts the underlying space by expanding local structure that would otherwise be too small to notice, giving some importance to weaker signals in the data.

Time series
The evolution of a population over time is described by a curve pðtÞ in sequence space. In practice, we can only measure the values of a curve pðtÞ at discrete time points, and the measurements are subjected to noise. As the first step of noise reduction, a 3-point median filter over time is applied to the sequence space representation, to robustly reduce the impact of noise spikes. Following the noise reduction, the curve pðtÞ is reconstructed in the d-dimensional representation of sequence space, as a piecewise linear curve connecting the data points. Then, each curve is reparameterized by arc length s, starting at s ¼ 0 for t ¼ 0, since differences in mutation rates can cause the population to move at different speeds through sequence space.
The sequence space representation in terms of variables (variants) is time-invariant, but it is nevertheless important to see how different parts of sequence space are explored as the replicates move. Let r be the first singular value, with corresponding left and right singular vectors u and v, after dimension reduction of a matrix X by SMSSVD, SVD or SPC, then r can be decomposed as a sum over variables and samples, where r ij :¼ u i X ij v j quantifies the importance of variable i and sample j for this component. By linear interpolation, this can be extended to r j ðrÞ ðsÞ, for intermediate values of the curve parameter s for replicate r. The contribution of variable j at s is measured by r j s ð Þ :¼ P r r j ðrÞ ðsÞ and r s ð Þ :¼ P j r j ðsÞ describes the importance of the first principal component at s. Plotting rðsÞ and r j ðsÞ along with the replicates, thus aid understanding of the dynamics. The definitions naturally extend to multiple components.

Bifurcations
We define the time of bifurcation bðp; qÞ, between two curves pðtÞ and qðtÞ as the similarity measure b p; q ð Þ¼ inf t : jjp t ð Þ À qðtÞjj 2 ! Bm È É ; that is, the first point in time at which the distance between pðtÞ and qðtÞ is above a threshold. Here, B is a chosen threshold and m a normalization constant chosen to make the expression scale-invariant. If p ðiÞ ðtÞ is defined for t 2 0; T i ½ and T ij :¼ min T i ; T j À Á , then the mean distance over time between curves i and j is jjp i ð Þ t ð Þ À p j ð Þ ðtÞjj 2 dt and we let m :¼ 1 NðNÀ1Þ P i6 ¼j m ij , the mean over all pairs of the N curves. Average linking hierarchical clustering, based on the time of bifurcation similarity scores, naturally extends the concept to clusters of samples, giving recursive cluster splits and a cluster similarity score equal to the time of bifurcation at each split. For piecewise linear curves, m and b p i ð Þ ; p ðjÞ can be computed analytically.