ShinyLearner: A containerized benchmarking tool for machine-learning classification of tabular data

Abstract Background Classification algorithms assign observations to groups based on patterns in data. The machine-learning community have developed myriad classification algorithms, which are used in diverse life science research domains. Algorithm choice can affect classification accuracy dramatically, so it is crucial that researchers optimize the choice of which algorithm(s) to apply in a given research domain on the basis of empirical evidence. In benchmark studies, multiple algorithms are applied to multiple datasets, and the researcher examines overall trends. In addition, the researcher may evaluate multiple hyperparameter combinations for each algorithm and use feature selection to reduce data dimensionality. Although software implementations of classification algorithms are widely available, robust benchmark comparisons are difficult to perform when researchers wish to compare algorithms that span multiple software packages. Programming interfaces, data formats, and evaluation procedures differ across software packages; and dependency conflicts may arise during installation. Findings To address these challenges, we created ShinyLearner, an open-source project for integrating machine-learning packages into software containers. ShinyLearner provides a uniform interface for performing classification, irrespective of the library that implements each algorithm, thus facilitating benchmark comparisons. In addition, ShinyLearner enables researchers to optimize hyperparameters and select features via nested cross-validation; it tracks all nested operations and generates output files that make these steps transparent. ShinyLearner includes a Web interface to help users more easily construct the commands necessary to perform benchmark comparisons. ShinyLearner is freely available at https://github.com/srp33/ShinyLearner. Conclusions This software is a resource to researchers who wish to benchmark multiple classification or feature-selection algorithms on a given dataset. We hope it will serve as example of combining the benefits of software containerization with a user-friendly approach.


Background 23
Classification falls under the category of supervised learning, a branch of machine learning. When 24 performing classification, researchers seek to assign observations to distinct groups. For example, medical 25 researchers use classification algorithms to identify patterns that predict whether patients have a particular 26 disease, will respond positively to a particular treatment, or will survive a relatively long period of time after 27 diagnosis [1][2][3][4][5][6][7][8][9][10][11]. Applications in molecular biology include annotating DNA sequencing elements, 28 identifying gene structures, and predicting protein secondary structures [12]. 29 Typically, a classification algorithm is "trained" on a dataset that contains samples (observations) from two or 30 more groups, and the algorithm identifies patterns that differ among the groups. If these patterns are reliable 31 indicators of group membership, the algorithm will be able to accurately assign new samples to these groups 32 and thus may be suitable for broader application. Different research applications require different levels of 33 accuracy before classification algorithms are suitable for broader application. However, even small 34 improvements in accuracy can provide large benefits. For example, if an algorithm predicts drug-treatment 35 responses for 1000 patients and attains accuracy levels that are 2% higher than a baseline method, this 36 algorithm would benefit 20 additional patients. Accordingly, a key focus of classification research in the life 37 sciences is to identify generalizable ways to optimize prediction accuracy.

38
The machine-learning community have developed hundreds of classification algorithms and have 39 incorporated many of these implementations into open-source software packages [13][14][15][16][17][18]. Each algorithm has 40 different properties, which affect its suitability for particular applications. In addition, most algorithms 41 support hyperparameters, which alter the algorithms' behavior and can affect the algorithms' accuracy 42 dramatically. In addition, feature-selection (or feature-ranking) algorithms can be used in complement to 43 classification algorithms, helping to identify combinations of variables that are most predictive of group 44 membership and aiding in data interpretation [19,20]. With this abundance of options to consider, researchers 45 face the challenge of identifying which algorithm(s), hyperparameter combinations, and features are optimal 46 for a particular dataset. 47 To improve the odds of making successful predictions, researchers should choose algorithms, 48 hyperparameters, and features based on empirical evidence rather than hearsay or anecdotal experience. Prior 49 studies can provide insight into algorithm performance, but few studies evaluate algorithms comprehensively, 50 and performance may vary widely for different types of data. One way to select these options empirically is 51 via nested cross-validation [21]. With this approach, a researcher divides a single dataset into training and 52 2 validation sets. Within each training set, the researcher divides the data further into training and validation 53 subsets and then evaluates various options using these subsets. The top-performing option(s) are then used 54 when making predictions on the outer validation set. Alternatively, a researcher might perform a benchmark 55 study, applying (non-nested) cross validation to multiple datasets from a given research domain. After testing 56 multiple algorithms, hyperparameters, and/or feature subsets, the researcher can examine overall trends and 57 identify options that tend to perform well [22,23] ShinyLearner is designed to be friendly to non-computational scientists-no programming is required. We 80 provide a Web-based tool (http://bioapps.byu.edu/shinylearner) to guide users through the process of creating 81 the Docker commands necessary to execute the software. ShinyLearner supports a variety of input formats 82 and produces output files in "tidy data" format[25], thus making it easy to import results into external tools.

83
3 Even though other machine-learning packages support nested cross validation, these evaluations may occur 84 in a "black box." ShinyLearner tracks all nested operations and generates output files that make this process 85 transparent.

86
Below we describe ShinyLearner in more detail and illustrate its use via benchmark evaluations. We evaluate 87 10 classification algorithms and 10 feature-selection algorithms on 10 biomedical datasets. In addition, we 88 assess the effects of hyperparameter optimization on predictive performance, provide insights on model 89 interpretability, and consider practical elements of performing benchmark comparisons.  Figure S1 shows an example ShinyLearner command that a user might execute. For convenience, and to help 104 users who have limited experience with Docker or the command line, we created a Web-based user interface 105 where users can specify local data paths, choose algorithms from a list, and select other settings 106 (https://bioapps.byu.edu/shinylearner). After the user has made these selections, the Web interface generates 107 a Docker command, which the user can copy and paste; Windows Command Line, Mac Terminal, and Linux 108 Terminal commands are generated. We used the R Shiny framework to build this web application [34]. ShinyLearner span methodological categories, including linear models, kernel-based techniques, tree-based 113 approaches, Bayesian models, distance-based methods, ensemble approaches, and neural networks. In 114 selecting algorithms to include, we focused primarily on implementations that can handle discrete and 115 continuous data values, support multiple classes, and produce probabilistic predictions. For each algorithm, 116 we reviewed documentation for the third-party software and identified a representative variety of 117 hyperparameter options. Admittedly, these selections are somewhat arbitrary and inexhaustive. However, 118 they can be extended with additional options. We excluded some algorithm implementations and 119 hyperparameter combinations because errors occurred when we attempted to execute them or because they 120 failed to achieve reasonable levels of classification accuracy on simulated data.  3. Request that these changes be included in ShinyLearner via a GitHub pull request.

129
ShinyLearner supports the following input-data formats: tab-separated value (.tsv), comma-separated value 130 (.csv), and attribute-relation file format (.arff). When tab-separated or comma-separated files are used, 131 column names and row names must be specified; by default, rows must represent samples (observations) and 132 columns must represent features (variables). However, transposed versions of these formats can be used 133 (features as rows and samples as columns); in these cases, the user should use ".ttsv" or ".tcsv" as the file 134 extension. ShinyLearner accepts files that have been compressed with the gzip algorithm (using ".gz" as the 135 file extension). Users may specify more than one data file as input, after which ShinyLearner will identify 136 sample identifiers that overlap among the files and merge on those identifiers. If the user specifies, 137 ShinyLearner will scale numeric values, one-hot encode categorical variables [35], and impute missing values. metrics. Typically, this process is repeated many times to derive confidence intervals for the accuracy metrics.

143
In k-fold cross validation, the process is similar, except that the data are partitioned into evenly sized groups 144 and each group is used as a validation set through rounds of training and testing. When multiple algorithms 145 or hyperparameter combinations are employed, ShinyLearner evaluates nested training and validation sets, 146 with the goal of identifying the optimal combination for each algorithm. Then it uses these selections when 147 making predictions on the outer validation set. Nested cross validation is also used for feature selection; a 148 feature-selection algorithm ranks the features within each nested training set, and different quantities of 149 top-ranked features are used to train the classification algorithm. The feature subsets that perform best are 150 used in making the outer validation-set predictions. Hyperparameter optimization and feature selection may 151 be combined; however, such analyses are highly computationally intensive for large benchmarks.

152
All outputs are stored in tab-delimited files, thus enabling users to import results directly into external 153 analysis tools. ShinyLearner produces output files that contain the following information for each 154 combination of algorithm, hyperparameters, and cross-validation iteration: 1) predictions for each sample, 2) 155 classification metrics, 3) execution times, and 4) standard output, including a log that indicates the arguments 156 that were used, thus supporting reproducibility. When nested cross-validation is performed, ShinyLearner 157 produces output for every hyperparameter combination that was tested in the nested folds and indicates 158 which combination performed best for each algorithm. • Accuracy (proportion of samples whose discrete prediction was correct)     As an initial test, we generated a "null" dataset using numpy [66]. We used this dataset to verify that   Initially, we applied 10 classification algorithms to 10 biomedical datasets using default hyperparameters.

254
Most algorithms made near-perfect predictions for the Thyroid, Dermatology, and Iris datasets, whereas 255 predictions were less accurate overall for the remaining datasets ( Figure 1). The weka/HoeffdingTree and 256 sklearn/decision_tree algorithms often underperformed relative to the other algorithms ( Figure 2).

257
Indeed, for half of the datasets, weka/HoeffdingTree performed as poorly or worse than would be 258 expected by random chance. The remaining 8 classification algorithms performed relatively well, but 259 predictive performance varied considerably across the datasets ( Figure S3). For example, the AUROC for 260 mlr/mlp and sklearn/logistic_regression was 0.07 higher than the median on the AIDS dataset; the 261 AUROC for sklearn/svm was 0.14 lower than the median. between these pairs of implementations ( Figures S6 and S7).

284
With the exception of sklearn/decision_tree, all classification algorithms produced sample-wise, 285 probabilistic predictions. We examined these predictions for the Diabetes dataset and found that the range 286 and shape of these predictions differed widely across the algorithms (Figure 3). Although many classification 287 metrics, including AUROC, can cope with distributional differences, these differences must be considered in 288 multiple classifier systems [77]. predictions. The range and distribution of these predictions differed greatly across the algorithms.

295
Classification analysis with hyperparameter optimization 296 In the second analysis, we applied the same classification algorithms to the same datasets but allowed 297 ShinyLearner to perform hyperparameter optimization via nested cross validation. As few as 2

298
(mlr/xgboost) and as many as 95 (sklearn/decision_tree and weka/MultilayerPerceptron) 299 hyperparameter combinations were available for each algorithm. In nearly every example, classification 300 performance improved after hyperparameter optimization (Figure 4), sometimes dramatically. The  ShinyLearner supports 53 hyperparameter combinations for the keras/dnn algorithm. Each of these 310 combinations altered the algorithm's performance at least to a small degree on every dataset ( Figure S8). The

311
Thyroid dataset varied least across the hyperparameter combinations, perhaps because the number of 312 instances (n = 7200) was nearly 10 times larger than any other dataset. Generally, this algorithm performed 313 better with a wider architecture containing only two layers. Having a wider structure greatly increases the 314 parameter space of the network and allows it to learn more complex relationships among features, while 315 limiting the network to only two layers prevents overfitting, a common problem when applying neural 316 networks to datasets with a limited number of instances. In addition, adding dropout and L2 regularization 317 also helps to prevent the network from overfitting. In tuning these hyperparameters, we found that a smaller 318 dropout rate, more training epochs, and a smaller regularization rate resulted in higher AUROC values 319 ( Figure S9). Figure S10 illustrates for the Diabetes dataset that diagnosis predictions can differ considerably, 320 depending on which hyperparameter combination is used.

321
Classification analysis with feature selection 322 In any dataset, some features are likely to be more informative than other features. We used ShinyLearner to Dermatology dataset, these feature ranks were highly consistent across the feature-selection algorithms 365 ( Figure S11). The goal of this classification problem was to predict a patient's type of Eryhemato-Squamous 366 disease. Elongation and clubbing of the rete ridges as well as thinning of the suprapapillary epidermis were 367 most highly informative of disease type, whereas features such as the patient's age were less informative. implemented in these frameworks is still relatively small.

399
The current release of ShinyLearner supports diverse classification algorithms and hyperparameter . Also importantly, algorithm performance is likely to differ according to data characteristics.

419
Algorithms that perform well on "wide" datasets (many features, few samples) may not perform as well on 420 "tall" datasets. Algorithms that perform well with numeric data may not perform as well on categorical or 421 mixed data. These differences highlight the importance of domain-specific benchmark comparisons.