Soft windowing application to improve analysis of high-throughput phenotyping data

Abstract Motivation High-throughput phenomic projects generate complex data from small treatment and large control groups that increase the power of the analyses but introduce variation over time. A method is needed to utlize a set of temporally local controls that maximizes analytic power while minimizing noise from unspecified environmental factors. Results Here we introduce ‘soft windowing’, a methodological approach that selects a window of time that includes the most appropriate controls for analysis. Using phenotype data from the International Mouse Phenotyping Consortium (IMPC), adaptive windows were applied such that control data collected proximally to mutants were assigned the maximal weight, while data collected earlier or later had less weight. We applied this method to IMPC data and compared the results with those obtained from a standard non-windowed approach. Validation was performed using a resampling approach in which we demonstrate a 10% reduction of false positives from 2.5 million analyses. We applied the method to our production analysis pipeline that establishes genotype–phenotype associations by comparing mutant versus control data. We report an increase of 30% in significant P-values, as well as linkage to 106 versus 99 disease models via phenotype overlap with the soft-windowed and non-windowed approaches, respectively, from a set of 2082 mutant mouse lines. Our method is generalizable and can benefit large-scale human phenomic projects such as the UK Biobank and the All of Us resources. Availability and implementation The method is freely available in the R package SmoothWin, available on CRAN http://CRAN.R-project.org/package=SmoothWin. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
High-throughput, large-scale phenotyping studies evaluate variables of an organism's biological systems to examine the contribution of genetic and environmental factors to phenotypes. Standardized phenotyping screens that cover a wide range of biological systems have made useful insights for identifying new genetic contributors to robust phenotypes when compared with more focussed studies that often target well-characterized genes with varying reproducibility (Begley and Ellis, 2012;Edwards et al., 2011;Freedman et al., 2015;Prinz et al., 2011;Stoeger et al., 2018). Leveraging economies of scale and using standardized procedures, high-throughput phenotyping screens addresses these challenges and have been applied in biological screening of chemical compound libraries, agricultural evaluation of crop plants, genome-wide CRISPR-based mutagenic cell line screens and multi-centre phenotypic screening of mutated model organisms (Al-Tamimi et al., 2016;Dickinson et al., 2016;Flood et al., 2016;Friggens et al., 2011;Malinowska et al., 2017;Sun et al., 2017;Vitak et al., 2017;Viti et al., 2015). The continuous generation of large volumes of data introduces new challenges affecting automated approaches to statistical analysis that have to scale with increasing data and address the underlying complexity inherent in large projects (Kurbatova et al., 2015;Meyers et al., 2017;Vaas et al., 2013Vaas et al., , 2012. The International Mouse Phenotyping Consortium (IMPC) is a G7 recognized global research infrastructure dedicated to generating and characterizing a knockout mouse line for every protein-coding gene (Bradley et al., 2012;Brown and Moore, 2012;Hrab e de Angelis et al., 2015). Currently, the IMPC has phenotyped over 148 000 knockouts and 43 000 control mice (data release 9.2, January 2019) across 12 research centres in 9 countries. These centres adhere to a set of standardized phenotype assays defined in the International Mouse Phenotyping Resource of Standardised Screens (IMPReSS), and designed to measure over 200 parameters on each mouse. As part of these standardized operating procedures, critical factors that can impact data collection, such as reagent type or equipment, are reported as required metadata. Phenotype data are then centrally collected and quality controlled by trained professionals before being released for analysis. All phenotype data are processed by the statistical analysis package PhenStat-a freely available R package that provides a variety of statistical methods for the identification of genotype to phenotype associations by comparing mutant to control data that have the same critical attributes (Kurbatova et al., 2015). For quantitative data, linear mixed models are typically employed with several factors modelled in including genotype, sex, sexgenotype interaction, body weight and batch (i.e. phenotype measures collected on the same day). Mutant mouse lines found to have a significant deviation in phenotype measurements are assigned a phenotype term from the Mammalian Phenotype Ontology (Blake et al., 2017). These associations, as well as the raw data, are disseminated via the web portal (https://www.mousephenotype.org) using application programming interfaces and data downloads.
A challenge with high-throughput phenotyping efforts is the small sample size for the experimental group (i.e. the knockout mice) that is produced to maximize the use of finite resources, considering biological relevance and power analysis (Charan and Kantharia, 2013). All mice generated by the IMPC are on the inbred C57BL/6N strain. To reduce genetic drift, IMPC centers maintain wild-type C57BL/6N production colonies that are periodically rederived using commercial vendors (Dickinson et al., 2016;Kurbatova et al., 2015). Mutant F0 mice are bred with wild-type mice from the production colonies to reduce the confounding effects of any de novo, non-targeted mutations. In addition, the IMPC centres are encouraged to measure these knockout mice in two or more batches, as this improves the false discovery rate by modelling in the random effect of day-to-day variation (Karp et al., 2014). In contrast, large control sample sizes accumulate as they provide a strong internal control of the pipeline and typically generated with every experimental batch. Such large control groups represent a unique dataset that increase the power of the subsequent analyses and allow the construction of a robust baseline (Bradley et al., 2012). However, this can lead to the accumulation of heterogeneities including seasonal effects, changes in personnel and unknown time-dependent environmental factors (Karp et al., 2014).
A simple approach to cope with heterogeneity in the data is to set explicit time boundaries (e.g. 1 year) before and after experimental collection dates. This 'hard windowing' approach will capture different time-frames depending on how much time elapses between the first and the last batch of experimental data measured. This approach is unsatisfactory for IMPC data as some mutant lines had enough experimental mice to measure in one batch, while others needed multiple batches over 18 months due to breeding difficulties or other factors. This variation in time-frames can lead to a widely different number of controls being applied to an analysis, making it challenging to explore correlations between mutant lines. Thus, more tuneable approaches were needed.
In this study, we address the complexity of the data collected over time by proposing a novel windowing strategy that we call 'soft windowing'. This approach utilizes a weighting function to assign flexible weights, ranging from 0 to 1, to the control data points. Controls that are collected on or near the date of mutants are assigned the maximal weights, whereas controls at earlier or later dates are assigned less weight. In contrast to the hard windowing, the weighting function in the soft windowing allows for different shapes and bandwidths by alternating the tuning parameters. In addition, we demonstrate how to tune parameters and demonstrate the implementation of the soft windowing on the IMPC data.

System and methods
In high-throughput projects, such as the IMPC, the model parameters may not stay constant over time that can lead to misleading inferences.
Soft windowing application to improve the data analysis For example, Figure 1 illustrates changes to the control group trend and/or variation over time for the Forelimb grip strength normalized against body weight and Mean cell volume. One approach widely used in signal processing (Ford, 2003;Kervrann, 2011;Lima et al., 2009;Poularikas, 2018) is to define a windowing function that includes the appropriate number of data points to capture the effect of interest while minimizing the noise. This is defined by where setting f ðxÞ to a constant, e.g. f ðxÞ ¼ 1, leads to hard windowing, while setting it to a smooth function results in the soft windowing.
The same approach can be generalized to multiple signals (Huang et al., 2007;Li et al., 2007;Tang et al., 2009) or applied as a rolling window (Harel et al., 2008) in the presence of exogenous variables to account for time dependency in the regression coefficients (Brown et al., 2018). Alternatively, we propose a soft windowing approach for the regression methods by defining a weighting function that applies less weight to the residuals outside the window of interest. This leads to distinct advantages over the hard windowing. First, the entire dataset is included in the analysis in contrast to the limited data points in the hard windowing. Second, the windowing and the parameter estimation are coupled, which is a direct result of using the weighted least squares (WLS). Critically, by bounding the controls in a window, we freeze the analysis and abrogate the need for further analysis assuming no new experimental data are generated within the time window.

Algorithm
Our novel windowing strategy explicitly defines the weighting function and proposes a simple but effective set of criteria to estimate the minimal window for the noise-power trade off.

Weight generating function
Let t ¼ ðt 1 ; t 2 ; . . . ; t n Þ represent a set of n continuous time units, m ¼ ðm 1 ; m 2 ; . . . ; m p Þ the time units when the treatments are measured (peaks in the windows), l ¼ fðl 1L ; l 1R Þ; ðl 2L ; l 2R Þ; . . . ; ðl pL ; l pR Þg a set of p non-negative left and right bandwidths and k ¼ fðk 1L ; k 1R Þ; ðk 2L ; k 2R Þ; . . . ; ðk pL ; k pR Þg a set of p positive left and right shape parameters. We impose the continuity on the time to simplify the definition of a continuous function over the time units, e.g. by converting dates to UNIX timestamps. Furthermore, we introduce a peak generating function (PGF) of the form of . . . ; p where Fðx; l; rÞ ¼ Pr X ðX xjl; rÞ is selected from the family of cumulative distribution functions with location l and scale r. In this study, we select F from the family of continuous and symmetric distributions (such as the Logistic, Gaussian, Cauchy and Laplace distributions). Then, we propose a weight generating function (WGF) of the form of where c Ã i ¼ ci maxci denotes the normalized PGF. The first term on the right-hand side of Equation (2) produces the individual windows and the second term accounts for merging the intersections amongst the windows. Figure 2 shows the symmetric WGF (SWGF) that is l iR ¼ l iL and k iR ¼ k iL ; i ¼ 1; 2; . . . ; p, for the different values of k 2 ½0:2; 50 coloured from blue (k ¼ 50) to red (k ¼ 0:2) and for the different values of l ¼ 5; 10; 15. The vertical black dashed lines show the hard window corresponding to the value of l. From this plot, the function is capable of generating a range of windows from hard (blue) to soft (red). Furthermore, the weights lay in the ð0; 1 interval for all values of time; however, they may not cover the entire ð0; 1 spectrum in a bounded time domain. Then, the weights are normalized to be ranged in ð0; 1 before inserting into the WGF as shown by c Ã i in Equation (2). Figure 3 shows the merge capability of the SWGF for the logistic F with m ¼ 15; 35 and different values of k ¼ 0:5; 1:5; 3 and l ¼ 6; 8; 10; 12. From this figure, the function is capable of producing a range of flexible multimodal windows (top) as well as aggregated windows (bottom) if m 1 þ l j j> jm 2 À lj for all m 1 < m 2 ; l 2 R. In all cases, the weights lay in the ð0; 1 interval.

Windowing regression
Let y ¼ xb þ e denote a linear model, with y, x, b and e representing response, covariates, unknown parameters and independent random noise e $ N 0; r 2 < 1 À Á , respectively. Imposing the weights in Equation (2) on the residuals leads to the following WLS: where : j j j j 2 denotes the second norm of a vector. matrix of weights from WGF and ( 0 ) denotes the transpose of a matrix. Weighted linear regression (WLR), in the context of this study, is equivalent to imposing less weight on the off modal time points with respect to m. We illustrate this in Figure 4, where 60 observations are simulated from the following model: iid Nð0; 1Þ and I is the indicator function, In other words, the model is piecewise linear and only significant in the t 2 20; 40 ð Þinterval. Figure 4 (top) shows the global estimation of the linear regression from the entire data (dotted black line) and the WLR by WGFðt; 9; 5; 30Þ (dashed blue line) as well as weights from the WGF on the bottom. This plot shows that the non-WLR leads to a horizontal line, where no significant gradient is detected, whereas the WLR tends to model the significant section of the data that leads to fitting the true line. Figure 4 compares the effect of windowing versus considering the entire dataset, showing the different conclusions.

Selection of the tuning parameters
Selection of the tuning parameters k and l to define the soft window has a strong impact on the final estimations and consequently on the inferences that are made from the statistical results. Indeed, a wide or over-smooth window can lead to the inclusion of too much noise, whereas a small window can result in low power in the analysis. An additional challenge is the direct linear correlation between increasing the number of peaks, m, and to the total number of the parameters for the windows l; k ð Þ that results in significant growth in the computational complexity of the final fitting. This is due to tuning the window in the general form of WLS in Equation (3) requires 2p dimensions in space to search for the optimal l and k. To cope with this complexity, we propose to fix l and k so all windows are symmetric and have the same shape and bandwidth. We then select the tuning parameters by searching the space on the grid of ðl; kÞ values and look for the most significant change in mean and/or variation of the residuals/predictions. The grid is searched by generating a series of scores from applying t-test (to detect changes in mean) and F-test (to detect change in variation) to the consecutive residuals/predictions at each step of expanding (l ! l þ k; k > 0Þ and/or reshaping (k ! k þ a; a > 0) the windows. This technique is based on the assumption that the mean and the variation of the residuals/predictions should remain unchanged in different time periods (St. Laurent, 1994).
To gain the necessary power in the analysis, we apply the statistical tests to the values of l that correspond to a minimum T observations in the windows. Then one can define the quantity of TðlÞ that is the total number of observations that is included in the hard window corresponding to l. We should stress that the definition of TðlÞ in the soft windowing can be challenging because the WGF assigns weights to the entire dataset in the final fitting. To address this complexity, we propose the Sum of Weights Score by SWSðk; lÞ ¼ P n i¼1 WGFðt i ; k; l; mÞ, that is the summation of weights from WGF for specific l and k. Note that SWSðl; kÞ ! TðlÞ with the equality for sufficiently large k. Because l is generally unknown, a value of TðlÞ ¼ T independent of l needs to be decided before the analysis. Our experiments, inspired by the z-test minimal sample size ðn > 30Þ, show that setting SWS ! T with T % maxð35; ffiffiffiffiffiffiffi ffi np 2 p Þ Single peak 35p Multiple peaks ( provides sufficient statistical power and precision for the analysis of each sex-parameter in IMPC.
Once the bandwidth, l, is selected, the shape parameter, k, can be optimized on a grid of values similar to l.
This algorithm is implemented for a broad range of models in the R package SmoothWin that is available from https://cran.r-pro ject.org/package¼SmoothWin. The main function of the package, SmoothWinð. . .Þ, allows an initial model for the input and, given a range of values for the bandwidth and shape, it performs soft windowing on the input model. Furthermore, it allows plotting of the results for diagnostics and further inspections. One also can generate the weights from SWGF using the expWeighð. . .Þ function.

Sensitivity analysis
The sensitivity of the soft windowing to the tuning parameters in particular, the minimum observation required in the window (T), is tested on the two IMPC examples introduced in Figure 1 for Mean cell volume and Forelimb grip strength normalized against body weight. To this end, the tuning parameters l, k and T are set to l The total range of the experiment time divided into 500 logarithmic distanced values; k the values in ½0:5; 10 interval divided into 50 logarithmic distanced values; T the values from 14 to the n divided into 25 logarithmic distanced values where n is the total observation in the dataset. We should stress that l and k are selected to cover the entire experiment range and avoid bias by selecting the incomplete ranges. Then we only study the effect of T on the final fittings. Figure 5 shows the sensitivity of the P-values to the change in the minimum observation required for the soft windowing, T. The left plots show the change in the P-value corresponded to the genotype effect in the linear mixed model (with genotype, sex, genotype-sex interaction and body weight in the fixed effect term and the batch in the random effect) for different values of T. The dashed blue vertical lines show the maximum toleration of T before a step-change in the P-values being observed. The right-hand side plots show the final fitting of the windowed model. The controls (triangles) weight are colour coded on a spectrum of green-purple, inside the window (green), on the border (grey) and outside the window (purple). Figure 5 shows the sensitive of soft windowing to the T, for instance, selection of a high value for T could lead to including too much noise in the final fitting.

Simulation study
To assess the performance of the soft windowing method, we implemented a resampling approach to construct a sample of artificial mutants from the IMPC control data by relabelling some controls as mutant. We then examined the difference in the number of false positives that were detected by the standard (non-windowed) analysis versus the soft-windowed approach. Since the resampling is only performed on the controls, we expect less false positives from the soft-windowed results.
Mutant data in the IMPC have a special structure, resulting from mice being born in the same litters and being phenotyped closely together in time (batch effect), which must be replicated in the resampling approach. We address this by utilizing structured resampling that replaces the mutants with the closest random controls in time. We create artificial mutant groups by randomly sliding the true mutant structure over the time domain of controls, collecting as many controls as there were mutants in the original set and repeating this procedure five times per dataset ( Supplementary Fig. S1 shows an illustration of three iterations of the structured resampling on the Bone Mineral Content parameter).
For non-windowed and soft-windowed analyses, the same statistical model is fitted. That is the linear mixed model implemented in the R package PhenStat with genotype, sex, genotype-sex interactions and body weight for the fixed effect terms and the batch in the random effect. This setup implies that the difference in the results is a direct consequence of the control selection strategy by soft windowing. The outcome of the simulation study consists of 18 IMPC procedures across 11 centres and over 2:5 million analyses and P-values. Comparing the results from the IMPC standard and softwindowed analyses on resampled data, we detect an overall of 14 201 and 12 716 false positives (FP), respectively, at the signficance level used by the IMPC, 0.0001: This constitutes more than a 10% relative improvement in FPs when the soft-windowed method is applied. Table 1 shows the top 10 IMPC procedures with the significant changes in the FPs. From this table, the procedures Body Composition, Open Field, Urinalysis, Heart Weight, Acoustic Startle and Pre-pulse Inhibition account for the highest relative reduction of 68% in FPs, whereas the Clinical Blood Chemistry, X-Ray, Insulin Blood Levels, Electrocardiogram and Eye Morphology account for the maximum increase of 32% in FPs. Supplementary Figure S2 shows parameters from the Body Composition and Clinical Blood Chemistry procudures that showed the biggest loss and gain in false positives for assocaited data parameters, respectively. This plot shows an improvement in decreasing FPs in all IMPC_DXA parameters, which contrasts with an increase in the FPs for IMPC_CBC parameters. We further examined the top two IMPC_CBC parameters, Alanine aminotransferase (IMPC_CBC_013) and Aspartate aminotransferase (IMPC_CBC_012) in Supplementary Figure S3, and noted a high level of randomly deviated points from the mean of controls that can bias the outcome of the structured resampling.

Soft windowing as part of the IMPC statistics pipeline
We next show the performance of the soft windowing approach on IMPC data by integrating it into the standard IMPC statistics pipeline in PhenStat (Kurbatova et al., 2016). To this end, each dataset is processed by the PhenStat for the initial estimation of a fully saturated linear mixed model including genotype, sex, genotype-sex interaction and body weight in the fixed effect term and the batch in the random effect. The resulting fit is then passed into the soft windowing algorithm in the R package SmoothWin for the determination of the optimal windowing weights. After determining the optimal weights, the final model is fitted using a weighted linear mixed model and utilizing a backward elimination approach to optimize the final model.
Using data release 9.2 (January 2019), we re-analysed 14 millionþ data points from which 10 millionþ are mutant animals across the range of IMPC phenotyping procedures. The original IMPC standard analysis that did not apply the soft windowing approach to select the control data encompassed 403 000þanalyses and P-values. This analysis led to 12 728 significant P-values (<0:0001Þ, compared with 16 415 significant P-values when the soft windowing was applied, an increase of 30% in total significant P-values. The IMPC assigns mouse lines with phenotype terms from the Mouse Phenotype Ontology (MPO) when a significant deviation from the control data is detected for a given data parameter (Meehan et al., 2017). Our windowing approach led to 17 391 MPO associations gained and 15 996 associations lost. To explore these differences further, we created an online tool that displays the entire control dataset for a given mouse line-parameter assay with the statistical summaries for both the non-windowed methodology and the soft-windowed approach. Users may filter on a number of attributes, arrange filter order, zoom in on data visualization or navigate directly to the results (https://wwwdev.ebi.ac.uk/mi/impc/ dev/phenotype-archive/media/images/windowing/). Figure 6 shows the corresponding visualization on the IMPC website for the complete dataset (including males and females) previously shown for males only in Figure 1 (top) for the Forelimb grip strength normalized against body weight parameter from the IMPC Grip Strength procedure. The soft window is indicated, as well as shows the response over time as well as the fitted soft windows. The tables underneath show the comparison between the descriptive statistics obtained from the standard (non-windowed) analysis on the left and the soft-windowed approach on the right. The P-values correspond to the genotype effect after applying the statistical analyses taking the corresponding controls based on the non-window and soft-windowed approaches, respectively changes in the total number of controls (here 1; 572 fewer after soft windowing-counting soft windowing weights >10 À7 ). Furthermore, the P-value corresponding to the genotype effect shows a significant change in magnitude, from 2:05 Â 10 À4 to 6:75 Â 10 À18 after applying the soft windowing. We then tested if our soft-windowed analysis changed our human disease model discovery rate. We have previously described the IMPC Phenodigm translational pipeline that automatically detects phenotypic similarities between the IMPC strains and over 7 000 rare diseases described in the Online Mendelian Inheritance in Man (OMIM), Orphanet and the Deciphering Developmental Disorders (DDD) databases (Meehan et al., 2017). This pipeline generates qualitative scores on how well a mouse line's associated phenotypes overlap with the phenotypes of the human rare disease populations (Akawi et al., 2015;Firth et al., 2009;Meehan et al., 2017;Mungall et al., 2015;OMIM Browser, 2017;Rath et al., 2012). By comparing the disease model resulting from our soft-windowed analysis versus non-windowed analysis for IMPC data release 9:2, we find a slight increase in the number of disease models (106 versus 99 models using a threshold of 50% phenotype overlap from a set of 2082 mouse lines that contain mutations-Supplementary File SI).

Discussion
High-throughput phenomics is a powerful tool for the discovery of new genotype-phenotype associations and there is an increasing need for innovative analyses that make effective use of the voluminous data being generated. Batch effects are inevitable when a large amount of data is collected at different times and/or sites and, therefore, need to be accounted for in the statistical analysis. In this study, we developed a novel 'soft windowing' method that selects a window of time to include controls that are locally selected with respect to experimental animals, thus reducing the noise level in the data collected over long periods of time (years). Soft windowing has notable advantages over a more traditional hard windowing approach. In contrast to the limited data points included in the hard windowing method, the entire dataset is considered for the analysis. To this end, we engineered a weighting function to produce weights in the form of a window of time. Control data collected proximally to mutants were assigned the maximal weight, while data collected earlier or later had less weight. This method has the capability of producing indivdual windows as well as merging intersected ones. Moreover, the method was implemented to automatically select window size and shape.
The performance of the method was shown on a simulated scenario that uses real control data collected by the IMPC highthroughput pipelines to assess detection of false positives. We also showed the enhancements to the IMPC statistical pipeline that establishes genotype-phenotype associations by comparing mutants versus control data using our soft-windowed approach.
There are two known conditions that affect the method: (i) the WGF can be slow when there are too many (>20) distinct windows, however, we have optimized the algorithm to be fast enough for the typical IMPC number of peaks (%3s for 1500 samples and 16 peaks under k ¼ 1 and l ¼ 30); and (ii) our resampling scenario indiciated that our soft windowing approach is sensitive to the data that have a high level of outliers or random deviation from the mean. This may result from a bias in the design of the resampling but may also indicate that using all available controls may be appropriate for the cases with extreme variability.
Our soft windowing approach addresses the scaling issues associated with analysing an ever-increasing set of control data in longterm projects by eliminating controls with weights sufficiently close to zero from future analysis. In the case of the IMPC, once a window of control data is determined for a dataset, there would be no further requirement to re-analyse the dataset with each subsequent data release. This will reduce the computational resources needed with the resulting gene-phenotype associations remaining stable, greatly facilitating data exchange with research groups trying to functionally validate genes and their disease variants. Our findings also have important implications for such efforts as the UK BioBank and the All of Us initiatives where large cohort sizes coupled with mobile medical sensors are generating phenotype data at an unprecedented rate (Sankar and Parker, 2017;Sudlow et al., 2015). Researchers performing restrospective analysis to analyse exposures for a defined outcome group (e.g. metabolic disease) are challenged by the variability and longitudinal characteristics associated with these datasets. The methods described here can be used with these human health resources to maximize analytical power and help researchers find the genetic and environmental contributers to human diseases.