DOSE-L1000: unveiling the intricate landscape of compound-induced transcriptional changes

Abstract Motivation The LINCS L1000 project has collected gene expression profiles for thousands of compounds across a wide array of concentrations, cell lines, and time points. However, conventional analysis methods often fall short in capturing the rich information encapsulated within the L1000 transcriptional dose–response data. Results We present DOSE-L1000, a database that unravels the potency and efficacy of compound-gene pairs and the intricate landscape of compound-induced transcriptional changes. Our study uses the fitting of over 140 million generalized additive models and robust linear models, spanning the complete spectrum of compounds and landmark genes within the LINCS L1000 database. This systematic approach provides quantitative insights into differential gene expression and the potency and efficacy of compound-gene pairs across diverse cellular contexts. Through examples, we showcase the application of DOSE-L1000 in tasks such as cell line and compound comparisons, along with clustering analyses and predictions of drug–target interactions. DOSE-L1000 fosters applications in drug discovery, accelerating the transition to omics-driven drug development. Availability and implementation DOSE-L1000 is publicly available at https://doi.org/10.5281/zenodo.8286375.

Steps of the statistical tests for identifying differentially expressed genes Let (, ) represent the mean response of a gene at concentration of  and time of .For models assuming the form of Eq. (3) and Eq. ( 4) in the main text, the statistical test entails the following steps: 1. State null and alternative hypotheses:  : ( * ,  * ) = (0,  * )  : ( * ,  * ) ≠ (0,  * ) 2. Set the significance level:  = 0.05.Let () represent the mean response of a gene at concentration of .For models assuming the form of Eq. ( 5) in the main text, the statistical test entails the following steps:
observations and the number of parameter estimates, respectively.
Let (, ) represent the mean response of a gene at concentration of  and time of .For models assuming the form of Eq. ( 6) in the main text, the statistical test entails the following steps: 1. State null and alternative hypotheses: given by: Let  (i.e., log [ ]) denote the log10 concentration at which  achieves the midpoint between  and  , i.e.,  = .We can show that if  ( ) is non-zero, the standard error of the estimated  is given by: First, let's consider the case where  ( ) is non-zero.It is imperative that  ( ) is negative; otherwise,  would not be the minimal log10-transformed concentration where  achieves the midpoint between  and  .This prerequisite enables us to establish a positive  such that  is monotonically decreasing and hence invertible on the interval ( − ,  + ).
Taking the partial derivative of  with respect to  * (1 ≤  In cases where  ( ) equals zero, we resort to bootstrap resampling to calculate the standard error of  .Similar solutions can be derived for GAMs described by Eq. ( 3) and Eq. ( 4) in the main text.

Comparison between the delta method and bootstrapping
Bootstrapping is a common method to estimate standard errors when closed-form solutions do not exist.It typically involves at least 1000 resamples to ensure reliable results, making it a time-consuming process.Deriving the closed-form expression (as detailed in the previous section) enables the delta method to efficiently calculate the standard error of potency.To understand the advantage of the delta method in speed, we compared its performance against bootstrapping using simulated data.Data (i.e., ground truth) were simulated based on the following equation:  = 5 + + , where  and  denote the log10-transformed compound concentration and log2-transformed gene expression level, respectively. ranges from -3 to 3 in increments of 0.5. is the normally distributed error term with mean equal to 0 and standard deviation equal to 0.1, i.e., ~(0, 0.1).We applied the delta method and bootstrapping with 1000 resamples separately to estimate the standard errors of log [ ] and log [𝐸𝐶 ].The delta method ran 813 times faster than bootstrapping.Our findings can be generalized to real-world settings, as the data we simulated here have the same structure as the gene expression data we fitted the models to.Given the need to fit nearly 20 million GAMs to the data, we determined that the delta method enabled by the closed-form expression of the standard error was the key to efficiently accomplishing this task.

Generalized additive mixed models and robust linear mixed models
In cases where the number of time points equals 1, generalized additive mixed models (GAMMs) treating the plate variable as a random effect takes the form: Here,  represents the log2-transformed gene expression level in the  th sample on the  th plate,  is the log10-transformed compound concentration,  is the plate effect, and  is the normally distributed error term.The smoothing spline function  is defined the same as in Eq. ( 2) in the main text.Model fitting is implemented using the gam() function within the mgcv R package (Wood 2017).
Under the same circumstances, robust linear mixed models (RLMMs) treating the plate variable as a random effect takes the form: Here,  denotes the mean of the  th concentration (including  = 0, representing the zero concentration),  denotes the plate effect,  indexes replicates, and  is assumed to follow a normal distribution.Model fitting is implemented using the rlmer() function within the robustlmm R package (Koller 2016).
For cases where the number of time points exceeds 1, GAMMs and RLMMs can be derived similarly as explained in Sections 2.1 and 2.2 in the main text.
To identify known HDAC inhibitors, we obtained the complete compound-target interaction dataset from Drug Target Commons (Tang et al. 2018).A compound was considered as an HDAC inhibitor if its gene symbol and activity type matched the keywords "HDAC" and "INHIBITION", respectively, in more than five instances.The efficacy profiles of all DOSE-L1000 compound-gene pairs available within the MCF-7, PC-3, and A549 cells at 6 hours were utilized for subsequent analysis.HDAC inhibitors were designated as benchmark ligands, while the remaining compounds constituted the background drugs.
Bridge Adjusted Expression Similarity (BAES) scores, as previously described, were computed (Wang et al. 2013).For enhanced clarity, we outline the detailed steps of this calculation here.Genes were ranked by log2 fold change, with the most upregulated genes at the top and the most downregulated genes at the bottom.We employed the gene set enrichment analysis (GSEA) algorithm to compare the efficacy profile of one compound (the query set) to another (the reference set) (Subramanian et al. 2005).For any given compound, all 978 landmark genes were included in the query set.Subsequently, a different compound from the list of HDAC inhibitors was chosen as the reference.All differentially expressed genes of the reference compound, defined as those with adjusted p-values less than 0.05, were included in the reference set.The GSEA algorithm generated an enrichment score, reflecting the degree of similarity between the queried compound and the reference compound (Subramanian et al. 2005).A pair of drugs could yield two enrichment scores by interchanging the roles of the query set and reference set.The BAES score was computed as the average of these two enrichment scores.The GSEA algorithm was implemented using the fgsea() function in the fgsea R package (Korotkevich et al. 2016).
To identify known gene-target pairs, we calculated the likelihood of interaction (LOI) as the average BAES score to the benchmark ligands and developed a simple binary classifier: compounds with LOI exceeding the threshold were classified as HDAC inhibitors, while those falling below the LOI threshold were not (Wang et al. 2013).The optimal LOI threshold was determined by selecting the threshold that maximized balanced accuracy.The performance of the classifier was evaluated using leave-one-out cross-validation (LOOCV).Each DOSE-L1000 compound, either a benchmark ligand or a background drug, took turns as the validation set.The remaining compounds constituted the training set.The class of the compound in the validation set was predicted based on the threshold determined from the training set.The sensitivity and specificity of the classifier were evaluated based on all validation sets combined.