Sstack: an R package for stacking with applications to scenarios involving sequential addition of samples and features

Matlock, Kevin; Rahman, Raziur; Ghosh, Souparno; Pal, Ranadip

doi:10.1093/bioinformatics/btz010

Abstract

Summary

Biological processes are characterized by a variety of different genomic feature sets. However, often times when building models, portions of these features are missing for a subset of the dataset. We provide a modeling framework to effectively integrate this type of heterogeneous data to improve prediction accuracy. To test our methodology, we have stacked data from the Cancer Cell Line Encyclopedia to increase the accuracy of drug sensitivity prediction. The package addresses the dynamic regime of information integration involving sequential addition of features and samples.

Availability and implementation

The framework has been implemented as a R package Sstack, which can be downloaded from https://cran.r-project.org/web/packages/Sstack/index.html, where further explanation of the package is available.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Traditional methods for predictive modeling of drug sensitivity have relied on a single model, such as a random forest (RF), built utilizing only one database (Rahman et al., 2017). But the full characterization of a disease pathway requires a vast array of biological features that may not be available from a single source. When integrating multiple different datasets, it is often observed that some features and samples may not be present in both databases. In order to overcome this limitation, models built on subsets of features and samples can be combined in order to improve prediction accuracy in a technique known as stacking (Matlock et al., 2018). In this article, we provide a R package that can generate integrated stacking models from a general staircase structure of samples and features. Using the Cancer Cell Line Encyclopedia (CCLE) (Barretina et al., 2012) we illustrate the advantages of horizontal stacking as compared to other forms of stacking. This package will be beneficial for practitioners with designing predictive models from databases containing samples with mixed set of features.

2 Methods and results

The problem is formulated using the template shown in Figure 1. In this figure we have a set of data from up to n separate databases where each database represents a single ‘layer’ in the template. Only a subset of features is shared between any two databases with an even smaller subset that is shared between ALL databases. We have envisioned three different methods in which we can stack this type of data which we denote as a heterogeneous dataset. The first method, horizontal (H) stacking, involves building individual models using all samples available for each independent set of features. Vertical (V) stacking builds individual models using samples with common sets of features. Finally, in L shaped stacking we attempt to combine the top models from the previous two methods. The first set consists of the $n / 2$ models from H stacking that contain the largest amount of sample data and the remaining $n / 2$ models are the V models that contain the largest number of features. The individual models are all RFs with 100 trees and the remaining parameters are defaults found in most implementations; a leaf size of 5 and a node splitting criteria of one-third the number of features.

Fig. 1.

Open in new tab Download slide

Illustration of stacking heterogeneous datasets problem

2.1 Bootstrap stacking

Our stacked models utilize a set of linear weights with an offset, w, to combine the output of individual models into the final prediction. Given a set of

1 \times n

predictions, y, and a set of

(n + 1) \times 1

weights the final prediction is given by Equation (1).

y_{f} = w_{0} + w_{1} y_{1} + w_{2} y_{2} + ... + w_{n} y_{n} .

(1)

Our objective is to estimate the optimal w using the set of samples that are common among all the individual models. However, estimating w with the same samples used for building the individual models would lead to over fitting. To prevent this, we perform bootstrap sampling to select a subset of common samples. Only the selected samples are used for building the individual models. The samples that are not selected are used for estimating the weights,

{\tilde{w}}^{l}

⁠. This process is repeated with a new set of bootstrapped samples for

l = 1, 2, ..., N_{b s}

iterations. The final set of weights is the average of all the estimated weights:

w_{i} = \frac{1}{N_{b s}} \sum_{l = 1}^{N_{b s}} {\tilde{w}}_{i}^{l} .

(2)

3 Performance

To illustrate the benefits of heterogeneous data stacking, from CCLE (Barretina et al., 2012) we have taken the gene expression values for 18 000+ genes and the area under the dose-response curve for the drug 17-AAG for 490 cancer cell lines. The feature selection algorithm RELIEFF is used to pick the top 300 gene expression features. We divide the chosen features and samples into four separate datasets utilizing the biased methodology detailed in Supplementary Section 1.3. The goal of our models is to predict the area under the dose-response curve value for a given cancer cell line by stacking these heterogeneous datasets. The number of bootstrap iteration, N_bs, is set heuristically to be 25 (see Supplementary Material). The accuracy is estimated by randomly selecting N_s samples and selecting 80% of the samples for training and leaving out the remaining for testing. After estimating the error, this process is repeated 500 times with a new training/testing set in each iteration before increasing N_s and then repeating the experiment. For a base line comparison, we also build a single model using the K-nearest neighbor impute method (Liew et al., 2011) to estimate the missing features. The mean-square error estimates of the stacked model predictions for an increasing number of samples are shown in Figure 2. These models are built on a 4-layer n = 4 heterogeneous dataset. In addition, we have calculated the 90% bootstrap confidence interval from all 500 iterations and those are also displayed in Figure 2. We note that horizontal stacking outperforms the other methods even for small sample sizes and that horizontal stacking also outperforms imputation with all but the smallest sample sizes. Similar graphs for different layers of stacking are shown in the Supplementary Material. Also in the Supplementary Material are the results of stacking CCLE data with the MD Anderson Cell Lines Project to improve drug sensitivity prediction (Li et al., 2017).

Fig. 2.

Open in new tab Download slide

Mean-square error analysis of horizontal (H_s), vertical (V_s) and L (L_s) stacked models built on a 4-layer heterogeneous dataset. Models are compared to just using K-nearest neighbor impute (Knn) to estimate the missing features. Error bars show the 90% bootstrap confidence interval calculated using 500 iterations

4 Conclusions

This framework has been implemented in R packaged labeled Sstack. The presented computational framework provides the enhanced features of (i) automated formulation of the supplied datasets into staircase format enabling the presented forms of stacking. (ii) Optimized stacking for RF based predictive modeling for integrating sample sets with different sets of features. In addition, we have shown that horizontal stacking outperforms other forms of stacking.

Funding

This work was supported by the National Institutes of Health grant [R01GM122084].

Conflict of Interest: none declared.

References

Barretina

J.

et al. . (

2012

)

The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity

.

Nature

,

483

,

603

–

607

.

Li

J.

et al. . (

2017

)

Characterization of human cancer cell lines by reverse-phase protein arrays

.

Cancer Cell

,

31

,

225

–

239

.

Liew

A.W.C.

et al. . (

2011

)

Missing value imputation for gene expression data: computational techniques to recover missing data from available information

.

Brief. Bioinform.

,

12

,

498

–

513

.

Matlock

K.

et al. . (

2018

)

Investigation of model stacking for drug sensitivity prediction

.

BMC Bioinformatics

,

19

,

71

.

Rahman

R.

et al. . (

2017

)

Heterogeneity aware random forest for drug sensitivity prediction

.

Sci. Rep.

,

7

,

11347

.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Associate Editor:

Download all slides

Month:	Total Views:
January 2019	12
February 2019	21
March 2019	8
April 2019	4
May 2019	10
June 2019	1
July 2019	1
August 2019	6
September 2019	74
October 2019	10
November 2019	8
December 2019	8
January 2020	4
February 2020	3
April 2020	1
May 2020	2
July 2020	2
August 2020	1
September 2020	7
October 2020	9
November 2020	6
December 2020	1
February 2021	1
March 2021	4
April 2021	10
May 2021	5
June 2021	4
July 2021	7
August 2021	2
September 2021	7
October 2021	3
November 2021	3
December 2021	6
January 2022	5
February 2022	7
March 2022	3
April 2022	9
May 2022	11
June 2022	17
July 2022	7
August 2022	8
September 2022	7
October 2022	10
November 2022	12
December 2022	6
January 2023	8
February 2023	8
March 2023	12
April 2023	3
May 2023	4
June 2023	5
July 2023	4
August 2023	18
September 2023	4
October 2023	2
November 2023	11
December 2023	4
January 2024	7
February 2024	4
March 2024	6
April 2024	8

Article Contents

Sstack: an R package for stacking with applications to scenarios involving sequential addition of samples and features

Abstract

1 Introduction

2 Methods and results

2.1 Bootstrap stacking

3 Performance

4 Conclusions

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

Article Contents

Sstack: an R package for stacking with applications to scenarios involving sequential addition of samples and features

Abstract

1 Introduction

2 Methods and results

2.1 Bootstrap stacking

3 Performance

4 Conclusions

Funding

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only