dsMTL - a computational framework for privacy-preserving, distributed multi-task machine learning

Multitask learning allows the simultaneous learning of multiple ‘communicating’ algorithms. It is increasingly adopted for biomedical applications, such as the modeling of disease progression. As data protection regulations limit data sharing for such analyses, an implementation of multitask learning on geographically distributed data sources would be highly desirable. Here, we describe the development of dsMTL, a computational framework for privacy-preserving, distributed multi-task machine learning that includes three supervised and one unsupervised algorithms. dsMTL is implemented as a library for the R programming language and builds on the DataSHIELD platform that supports the federated analysis of sensitive individual-level data. We provide a comparative evaluation of dsMTL for the identification of biological signatures in distributed datasets using two case studies, and evaluate the computational performance of the supervised and unsupervised algorithms. dsMTL provides an easy- to-use framework for privacy-preserving, federated analysis of geographically distributed datasets, and has several application areas, including comorbidity modeling and translational research focused on the simultaneous prediction of different outcomes across datasets. dsMTL is available at https://github.com/transbioZI/dsMTLBase (server-side package) and https://github.com/transbioZI/dsMTLClient (client-side package).


Introduction 80
The biology of many human illnesses is encoded in a vast number of genetic, epigenetic, molecular, 81 and cellular parameters. The ability of Machine Learning (ML) to jointly analyze such parameters and 82 derive algorithms with potential clinical utility has fueled a massive interest in biomedical ML 83 applications. One of the fundamental requirements for such ML algorithms to perform well is the 84 availability of data at a large scale, a challenge of steadily declining importance due to the ever-85 increasing availability of biological data 1-3 . As data can often not be freely exchanged across institutions 86 due tothe need for protection of the individual privacy, the utility of 'bringing the algorithm to the data' 87 is becoming apparent. Technological solutions for this task have thus risen in popularity and exist in 88 various forms. One of the most straightforward approaches is the so-called federated ML, where 89 algorithms are simultaneously learned at different institutions and optimized through a privacy-90 preserving exchange of parameters. Other approaches for this task include the training of ML 91 algorithms on temporarily combined data stored in working memory 4 or the more recently introduced 92 'swarm-learning' approach 5 . One commonality of most ML algorithms, federated or not, is the 93 assumption that all investigated observations (e.g. illness-affected individuals) represent the same 94 underlying population. However, in biomedicine, this is rarely the case, as biological and technological 95 factors frequently induce cohort-specific effects that limit the ability to identify reproducible biological 96 findings. Multitask Learning (MTL) can address this issue through the simultaneous learning of 97 outcome (e.g. diagnosis) associated patterns across datasets with dataset-specific, as well as shared, 98 effects. Multi-task learning has numerous exciting application areas, such as comorbidity modeling, 99 and has already been applied successfully for e.g. disease progression analysis 6 . 100 Here, we describe the development of dsMTL ('Federated Multi-Task Learning for DataSHIELD'), a 101 package of the statistical software R, for Federated Multi-Task Learning (FeMTL) analysis (Figure 1) . 102 dsMTL was developed for DataSHIELD 7 , a platform supporting the federated analysis of sensitive 103 individual-level data that remains stored behind the data owner's firewall throughout analysis 8 . dsMTL 104 includes three supervised and one unsupervised federated multi-task learning algorithms that extend algorithms previously developed for non-federated analysis (for R implementations, see 9,10 ). 106 Specifically, the dsMTL_L21 approach allows for cross-task regularization, building on the popular 107 LASSO method, in order to identify outcome-associated signatures with a reduced number of features 108 shared across tasks. The non-federated version of this approach has previously been applied to 109 simultaneously predict multiple oncological outcomes using gene expression data 11 . The dsMTL_trace 110 approach constrains the coefficient vectors in a low-dimensional space during the training procedure 111 to penalize the complexity of task relationships, resulting in an improved generalizability of the models. 112 In a non-federated implementation, this method has previously been used to predict the response to 113 different drugs, and the identified models showed a high degree of interpretability in the context of 114 the represented drug mechanism 12 . dsMTL_net incorporates the task relationships that can be 115 described as a graph, in order to improve biological interpretability. In a non-federated version, this 116 technique has previously been used for the integrative analysis of heterogeneous cohorts 13 and for the 117 prediction of disease progression 14 . The dsMTL_iNMF approach is an unsupervised, integrative non-118 negative matrix factorization method that aims at factorizing the cohorts' data matrices into shared 119 and dataset-specific components. Such modeling has been applied to explore dependencies in multi-120 omics data for biomarker identification 10,15 . In addition to the FeMTL methods, we also implemented 121 a federated version of conventional Lasso (dsLasso) 16 in dsMTL package due to its wide usage in 122 biomedicine and as a benchmark for testing the performance of the federated MTL algorithms. 123 To explore the utility of the dsMTL algorithms, we used a network comprising three servers. These 124 servers hosted simulated data with variable degrees of cross-dataset heterogeneity, in order to test 125 the ability of the MTL algorithms to suitably characterize shared and specific biological signatures. In 126 addition, we analyzed actual RNA sequencing and microarray data across the three-server network, to 127 show that the accurate analysis can be performed in acceptable runtime using dsMTL in real network 128 latency. 129

131
Here we show the results for two case studies. The first case study aims at demonstrating the utility of 132 the supervised dsMTL_L21 algorithm to identify 'heterogeneous' target signatures across the data 133 network. With 'heterogeneous' we describe signatures that involve the same features (e.g. genes) but 134 with potentially differing signs (indicating differential directions of influences) across datasets. In 135 contrast, 'homogeneous' signatures relate to the same features and signs across datasets. The second 136 case study focuses on the unsupervised dsMTL_iNMF method and explores the utility of the federated 137 implementation, compared to the aggregation of local NMF models, to disentangle shared and cohort-138 specific components across datasets. For all case studies, we evaluated the signature identification 139 accuracy as the major metric. For predictions of clinical outcomes, the prediction accuracy was also 140 demonstrated. 141 142

Case study 1 -distributed MTL for identification of heterogeneous target signatures 143
With the aim to identify 'heterogeneous' signatures, we compared the performance of dsMTL_L21, 144 dsLasso and the bagging of glmnet models. As part of this, we explored the sensitivity of these methods 145 to different sample sizes (n) relative to the gene number (p). Figure 2 shows the resulting prediction 146 performance and gene selection accuracy, each averaged over 100 repetitions. dsLasso showed the 147 worst prediction performance in this heterogeneous setting, and dsMTL_L21 slightly outperformed 148 the aggregation of local models (glmnet). Similarly, the gene selection accuracy of dsLasso was inferior 149 to that of dsMTL_L21 and glmnet-bagging, which showed similar performance when the sample size is 150 sufficiently large, e.g. the number of subjects approximately equal to the number of genes (n/p ~1). 151 However, with a decreasing n/p ratio, dsMTL_L21 showed an increasing superiority over the other 152 methods, especially for n/p=0.15, where the gene selection accuracy of dsMTL_L21 was over 2.8 times 153 higher than that of the bagging technique. 154 Figure 3 shows the performance of distributed and aggregated local NMF methods for disentangling 157 shared and cohort-specific signatures from multi-cohort data, given different 'severities' of the 158 signature heterogeneity. For both types of signatures, dsMTL_iNMF outperformed the ensemble of 159 local NMF models for any heterogeneity severity setting. Notably, even with increasing heterogeneity, 160 the accuracy of dsMTL_iNMF to capture shared genes remained stable at approximately 100%, 161 illustrating the robustness of dsMTL_iNMF against the heterogeneity's severity shown in Figure 3c. In 162 contrast, for the ensemble of local NMF, the gene selection accuracy of the shared signature 163 continuously decreased to approximately 50% (20% of outcome-associated genes were shared among 164 cohorts), while the gene selection accuracy of cohort-specific signatures continuously increased to 75% 165 (20% of outcome-associated genes were shared among cohorts ) as shown in Figures 3a and 3b. 166 167 Efficiencyof supervised dsMTL 168 We aimed at determining the efficiency of supervised dsMTL using the real molecular data and the 169 actual latency of a distributed network. Using a three-server scenario (see Table 2 Supplementary 170 Results; two servers at the Central Institute of Mental Health, Mannheim; one server at BioQuant, 171 Heidelberg University) we analyzed four case-control gene expression datasets of patients with 172 schizophrenia and controls (median n=80; 8013 genes). Supplementary Table 3 shows the comparison 173 between dsLasso and mean-regularized dsMTL_net, which were trained (cross-validation + training) 174 and tested in approximately 8min and 10min, respectively, with the time-difference being due to the 175 increased network access of dsMTL. The prediction accuracy of dsMTL was slightly higher than that of 176 dsLasso, consistent with our previous study 13 . Regarding model interpretability, dsLasso captured a 177 signature comprising 38 genes but could not distinguish shared and cohort-specific effects. Mean 178 regularized dsMTL identified a signature with 10 genes shared among all cohorts, with 163 genes 179 shared by two cohorts, as well as three cohort-specific signatures comprising 1532 genes. 180

Efficiency of unsupervised dsMTL 182
The cohorts and server information is shown in Supplementary Table 4. It took 34.9 minutes (1,003 183 times network accesses) to train a dsMTL_iNMF model with 5 random initializations (~7 min for each 184 initialization). The factorization rank k=4 was selected as the optimal parameter. In Supplementary 185 Figure 1, the objective curve illustrates that the training time was sufficient for model convergence. In 186 this analysis, a shared signature comprising 473 genes between SCZ and BIP was identified, while two 187 disease-specific signatures containing 37 genes for SCZ and 152 genes for BIP, respectively, were found. We here present dsMTL -a secure, federated multi-task learning package for the programming 193 language R, building on DataSHIELD as an ecosystem for privacy-preserving and distributed analysis. 194 Multi-task learning allows the investigation of research questions that are difficult to address using 195 conventionalML, such as the identification of heterogeneous, albeit related, signatures across datasets. 196 The implementation of a privacy-preserving framework for the distributed application of MTL is an 197 essential requirement for the large-scale adoption of MTL. Using such a distributed server setup, we 198 demonstrate the applicability and utility of dsMTL to identify biomarker signatures in different settings. 199 For applications where the target biomarker signatures are different, but relate to an overlapping set 200 of features (explored here as the 'heterogeneous' case), conventional machine learning would not be 201 a meaningful algorithm choice. We show that MTL is able to identify the target signatures with high 202 confidence and may thus be a reasonable choice for a diverse set of interesting analyses. As mentioned 203 above, a particularly noteworthy application is comorbidity modeling, where the target signatures 204 index the shared (although potentially heterogeneously manifested) biology of multiple, clinically 205 comorbid conditions. Such analyses could potentially be a powerful, machine learning-based extension 206 of comorbidity modeling approaches based on univariate statistics that have already been very useful 207 for characterizing the shared biology of comorbid illness 17 . We show that unsupervised MTL can 208 disentangle the shared from cohort-specific effects, demonstrating its potential utility for comorbidity 209 analysis. Other applications for this method include the analysis of biological patterns shared across 210 clinical symptom domains, between clinical and demographic characteristics, or with digital measures, 211 such as ecological momentary assessments. 212 The use of dsMTL follows the concept of the so-called "freely composing script" in the DataSHIELD 213 ecosystem. It organizes a given dsMTL workflow as a free composition of dsMTL, DataSHIELD, and local 214 R commands (e.g. R base functions, customer-defined functions and CRAN packages) into a script, such 215 that the geo-distribution of datasets and the federated computation are transparent to users. This 216 concept is similar to that of the "freely composing apps" used in a recently presented federated ML 217 application 18 , which allows flexible scheduling of functions in the form of apps and improves the 218 federated data analysis flexibility for users. In addition to dsMTL, other packages in the DataSHIELD 219 ecosystem exist for e.g. "big data" storage and management 19 , various statistical tests 7,19 and deep 220 learning 19,20 . 221 Interesting future developments of the dsMTL approach could include the implementation of 222 asynchronous communication, which provides a probabilistically approximate solution but faster 223 convergence 21,22 . Furthermore, integration of other popular systems for ML, such as tensorflow 23 , for 224 which interfaces with the R language already exist, would provide valuable additions to the DataSHIELD 225 system. Finally, a noteworthy consideration is an architecture underlying the distributed data 226 infrastructure. DataSHIELD builds on a centralized ("client-server") architecture and each data provider 227 needs to install a well-configured data warehouse. Such infrastructure is suitable for long-term 228 collaboration scenarios and large consortia projects that conduct a broad spectrum of complex 229 analyses requiring high flexibility. However, in other scenarios that require more temporary and easy-compute collaboration setups, a server-free or decentralized architecture 24 might be more suitable, 231 because the cost of data provider for participating is low. 232 In conclusion, the dsMTL library for the programming language R provides an easy-to-use framework 233 for privacy-preserving, federated analysis of geographically distributed datasets. Due to its ability to 234 disentangle shared and cohort-specific effects across these datasets, dsMTL has numerous interesting 235 application areas, including comorbidity modeling and translational research focused on the 236 simultaneous prediction of different outcomes across datasets. In dsMTL, two approaches for sharing information across cohorts are included, 1) shared parameters 249 and 2) cross-task regularization, leading to a slightly different distributed computation. The shared 250 parameters are estimated using all cohorts. For cross-task regularization, the cohort-specific 251 parameters are estimated using only the local data, and then tuned by considering parameters from 252 other cohorts.

Efficiency 254
Most dsMTL methods aim at training an entire regularization tree. The determination of the λ 255 sequence controls the tree's growth and is essential for computational speed. The λ sequence should 256 be accurately scaled to both capture the highest posterior and avoid overwhelming computations. 257 Inspired by a previous study 25 , we estimate the largest and smallest λ from the data by characterizing 258 the optima of the objective using the first-order optimal condition and then interpolate the entire λ 259 sequence on a log scale (see supplementary methods for more details). In addition, several options are 260 provided to improve the speed of the algorithms by decreasing the precision of the results, i.e., 1) the 261 number of digits of parameters for transformation can be specified to reduce the network latency; 2) 262 several termination rules are provided, some of which are relaxed; 3) the depth of the regularization 263 tree can be shortened. More details can be found in supplementary methods. 264 Besides the efficiency of the federated ML/MTL methodology, the import/export of "big data" cohorts 265 is also crucial for computational efficiency, where e.g. uncompressed GWAS data requires tens of 266 gigabytes, leading to time-consuming data import. dsMTL was designed to support a wide variety of 267 data types. For this, an architecture package resourcer 19 developed by the DataSHIELD community was 268 incorporated to facilitate the efficient import and export of large-scale datasets in compressed formats. 269 For example, in DataSHIELD, GWAS data of the PLINK file formats can be read and processed using the 270 software PLINK 26 as the backend 19 . 271 Security 272 dsMTL was developed based on DataSHIELD 8 , which provides comprehensive security mechanisms not 273 specific to machine learning applications. For example, 1) DataSHIELD requires the data analysis to 274 only occur behind the firewall; 2) each server is only allowed to communicate with a set of clients with 275 fixed IP addresses; 3) the network communication is protected by an SSL protocol; 4) an R parser 8 276 implemented on the server rejects the calling of unwanted functions; and 5) the so-called 'disclosure 277 control' 8 on the server ensures that the returned response does not contain any disclosive information. 278 In addition, several permissions can be set by the data providers to fully control the usage of their data. 279 These permissions describe the degree of accessibility of data and functions on the server i.e. "which 280 users can perform what actions on what data". In an extremely secure example, a user could be 281 granted to check the summary of a given dataset but cannot perform any actions because no functions 282 were granted. With these settings, DataSHIELD allows customizing the security protection strategies 283 according to the specific requirements of the applications. For statistical and machine learning analyses, 284 DataSHIELD assumes that summary statistics are safe to share. 285 dsMTL inherits all these security mechanisms. In addition, we considered potential ML-specific privacy 286 leaks, such as membership inference attacks 27 and model inverse attacks 28 . Inverse attacks aim at 287 extracting the individual observation-level information from the models. Membership inference 288 attempts to decide if an individual was included in a given training set using the model. All these 289 techniques require a complete model for inference. Since multi-task learning returns multiple matrices, 290 returning an incomplete model could be one strategy against these attacks. For example, dsMTL_iNMF 291 in dsMTL only returns the homogenous matrix (H), whereas the cohort-specific components ( , ) 292 never leave the server. For example, in a two-server scenario, one (H) out of five output matrices is 293 transmitted between the client and the servers. With such an incomplete model, inverse construction 294 of the raw data matrix becomes difficult, and the risk of an inverse attack and membership inference 295 is reduced. For most biomedical analyses, the H matrix is sufficient for subsequent studies. In addition, 296 if the analyst was authorized to access the raw data of the server, the so-called "data key mechanism" 297 (see supplement) would allow the analyst to retrieve all component matrices. For supervised multi-298 task learning methods in dsMTL, all models have to be aggregated within the clients, and thus we 299 suggest the data providers enable the option on the server that rejects a returned coefficient vector 300 containing parameter numbers exceeding the number of subjects. In this way, the model is not 301 saturated and more robust to an inverse attack. 302

Proof of concept with simulation and actual data
Two case studies and speed-tests were conducted to demonstrate the suitability of dsMTL methods to 304 analyze heterogeneous cohorts, compared to federated ML methods and ensemble of local models 305 regarding the prediction performance, interpretability and computational speed. An overview of 306 methodological aspects related to the case studies is detailed below. For an extensive methodological 307 description, please see the supplementary Methods. 308 Case study 1. In this case study, the heterogeneous cohorts were generated with the same set of 309 outcome-associated genes. These however showed different directionality of their respective 310 associations with the outcome. A three-server scenario was simulated. 150 out of 500 features with 311 random signs across cohorts were simulated. Seven tests were created for simulating different n/p 312 ( sample size gene number ) ratios. The n/p ratio was {1.2, 1, 0.9, 0.6, 0.5, 0.3, 0.15} with the number of subjects 313 {600, 500, 450, 300, 250, 150, 75} for each test. 500 genes were created for each server. The test 314 sample consisted of 200 subjects for each server. Data were generated as follows: 315 Given gene number p = 500, the models of three cohorts were { (1) , (3) , (3) } where (.) = p × 1. 316 A shared signature comprising 150 genes was generated for each (.) but with random signs, (.) = 317 { 2 × ( − 0.5) × (1, 0.1) 1 < < 150 0 others , ~Bernoulli (   1  2 ). The expression values of each subject 318 across cohorts were generated as x = 1 × p where ~(0,1). The numeric outcome (e.g. symptom 319 severity) y = xw ( ) in cohort i was standardized in a normal distribution (0, 1) , then model-320 irrelevant noise with 50% of the variance of the true signal was added y = y + (0, 0.5). 321 dsMTL_L21 and dsLasso were trained as the federated learning system, and the hyper-parameter was 322 selected using 10 fold in-cohort cross-validation. For glmnet, the ensemble technique was only applied 323 on the gene selection due to the consistent gene set of their signatures. The mean squared error (mse) 324 was used as the measure of prediction performance. To account for the sampling variance, we 325 repeated each analysis 100 times.