Effective design and inference for cell sorting and sequencing based massively parallel reporter assays

Abstract Motivation The ability to measure the phenotype of millions of different genetic designs using Massively Parallel Reporter Assays (MPRAs) has revolutionized our understanding of genotype-to-phenotype relationships and opened avenues for data-centric approaches to biological design. However, our knowledge of how best to design these costly experiments and the effect that our choices have on the quality of the data produced is lacking. Results In this article, we tackle the issues of data quality and experimental design by developing FORECAST, a Python package that supports the accurate simulation of cell-sorting and sequencing-based MPRAs and robust maximum likelihood-based inference of genetic design function from MPRA data. We use FORECAST’s capabilities to reveal rules for MPRA experimental design that help ensure accurate genotype-to-phenotype links and show how the simulation of MPRA experiments can help us better understand the limits of prediction accuracy when this data are used for training deep learning-based classifiers. As the scale and scope of MPRAs grows, tools like FORECAST will help ensure we make informed decisions during their development and the most of the data produced. Availability and implementation The FORECAST package is available at: https://gitlab.com/Pierre-Aurelien/forecast. Code for the deep learning analysis performed in this study is available at: https://gitlab.com/Pierre-Aurelien/rebeca.


Supplementary Note 1: Comparing experimental designs using Bayesian decision theory
Here we derive the criterion used to compare different Flow-seq experimental design. Our goal is to estimate for each genetic variant the corresponding fluorescence parameters θ * ∈ R 2 (either the Log-Normal or Gamma parameters). For this purpose, the Flow-seq protocol is used, which depends on a choice of experimental factors e ∈ R 3 , which includes 1. the number of sequencing reads, 2. the number of cells sorted, and 3. number of bins used for cell sorting.
The outcome for the Flow-seq experiment is sequencing data that acts an observable z ∼ P(z | θ * ; e).
This data is then processed by an estimator (either ML or MOM) to claculate the fluorescence parameters θ ∼ P(θ | z) for each genetic variant. P(z | θ * ; e) accounts for the randomness that occurs when growing the cells containing the library and during the sequencing process, while P(θ | z) describes the randomness occurring during the Nelder-Mead optimization step used by the ML estimator. However, we found this second source of randomness negligible during our experiments and so neglected its contribution.
For different values of experimental factors e, the deviations from the true value θ * will be most important and can be measured using an appropriate loss function. A popular choice for experimental design involves minimizing the total parameter variance of the estimates (i.e., A-design 1 ), although such a choice would ignore the bias of the estimates 2 . To better understand the influence of the experimental design choices e on the fluorescence parameter estimates, we considered two different loss functions. Firstly, a loss function comparing globally the inferred and ground-truth distribution, as measured by the 1-Wasserstein distance: with F(x) being the respective cumulative distribution function. And secondly, a local loss function to quantify the differential error between the two parameters of the fluorescence distribution θ = (θ 1 , θ 2 ): The choice of the absolute relative error loss instead of the more common squared error loss was motivated for two reasons. Firstly, the logarithmic scaling used in flow cytometry binning will inherently lead to a decrease in measurement quality for constructs displaying high-fluorescence. This would inflate the loss if using the squared error. Secondly, the absolute error loss is more interpretable and useful for biological engineers, who prefer to bound the magnitudes of the error rather than the absolute value 3 .
Utility functions are often impossible to minimize uniformly with respect to the decision (here the experimental design e) as the parameter θ * is unknown and the inferred parameter θ is a random variable.
A frequentist approach would imply averaging over all replicates of the Flow-seq experiments to calculate the frequentist risk 4 : However, the frequentist risk is a function of the parameter θ to estimate, which is highly variable in these types of experiments.
To resolve this issue, we can define the integrated risk r(e), which considers the average over a prior 2/16 π(θ * ) of the parameter θ * : The integrating risk induces a total ordering, allowing us to compare the loss function for different experimental designs and inference methods. We resorted to Monte-carlo simulations to compute the integrated risk as it contains a double integral over the parameter and the observations space, which is challenging to compute analytically. This lead us in practice to averaging the inference results over a library of constructs (either from the Taniguchi 5 or Cambray 6 data sets), which constitutes our prior on the parameter space, and conducted many Flow-seq replicates to average over the Flow-seq realisations.   (d) Distribution of the test set performance metrics of the best hybrid CNN-BiLSTM model.