Penalized regression with multiple sources of prior effects

Abstract Motivation In many high-dimensional prediction or classification tasks, complementary data on the features are available, e.g. prior biological knowledge on (epi)genetic markers. Here we consider tasks with numerical prior information that provide an insight into the importance (weight) and the direction (sign) of the feature effects, e.g. regression coefficients from previous studies. Results We propose an approach for integrating multiple sources of such prior information into penalized regression. If suitable co-data are available, this improves the predictive performance, as shown by simulation and application. Availability and implementation The proposed method is implemented in the R package transreg (https://github.com/lcsb-bds/transreg, https://cran.r-project.org/package=transreg).


Background
For many biomedical prediction or classification studies, there is a previous study with a similar target and a similar highdimensional feature space, e.g.hundreds of microRNAs (miRNAs), thousands of genes, or millions of singlenucleotide polymorphisms (SNPs).Given a trained model from a previous study, we could use it to obtain predicted values or predicted probabilities for the study of interest, but these predictions are only reliable if the two studies have the same target, the same features, and the same population.However, we expect the feature-target effects from two studies to be strongly correlated in more situations: slightly different targets (e.g.disease status versus disease stage), slightly different features (e.g.imperfectly overlapping feature space, different measurement technique), slightly different populations (e.g.hospitalized versus non-hospitalized patients), or even different modelling approaches (e.g.simple regression versus multiple regression).As it is challenging to estimate feature-target effects in high-dimensional settings, it might be advantageous to use results from previous studies as prior information for the study of interest.
Consider two prediction or classification problems, each one with a target vector and a feature matrix (samples in the rows, features in the columns).Suppose that both feature matrices cover the same features (each column in the first matrix corresponds to a column in the second matrix).In two special cases, the two problems reduce to a single problem: (i) if both problems have the same target and concern samples from the same population, they are in essence one 'single-target' problem (combine target vectors and feature matrices by rows, respectively), potentially with batch effects; and (ii) if both problems concern the same samples, they are in essence one 'multi-target' problem (combine target vectors by columns, feature matrices are the same).In other cases, however, we might be in a transfer learning setting (Table 1).
In such settings-two or more regression problems with related targets and matched features-it might be possible to transfer information from one problem to another.Transfer learning, in contrast to multi-target regression and seemingly unrelated regressions, addresses related regression problems not for the same but for different samples.If the regression problems are sufficiently related to each other, we expect their regression coefficients to be correlated (positively or negatively).When fitting the regression model of interest, we could therefore account for the estimated regression coefficients from the other model.Transferring information on the importance and the direction of the feature effects, we could potentially increase the predictive performance.Jiang et al. (2016) proposed the prior lasso to account for prior information in high-dimensional predictive modelling.Their method involves a preprocessing step and a weighting step.In the preprocessing step, the prior information is used to predict the target from the features.They present a solution for one set of prior effects from a closely related study (multiplying the feature matrix by the prior effects), but extensions to multiple sets of prior effects or loosely related studies may be feasible.Let y represent the target and let ŷprior represent the fitted values based on the prior information.In the weighting step, they minimize the penalized combined likelihood Lðx; y; bÞ þ gLðx; ŷprior ; bÞ À qðk; bÞ with respect to the coefficients b, where g !0 (balance) and k !0 (regularization).If the balancing hyperparameter g is larger than zero, the prior predictions ŷprior influence the estimation of the parameters b.Dhruba (2021) proposed a transfer learning method based on distribution mapping.Even if features or targets follow different distributions in two datasets, it is possible to build a predictive model using the first dataset and make predictions for the second dataset.Requiring matched features and targets in the source dataset and unmatched features and targets in the target dataset, their method transfers (i) features from the target to the source domain and (ii) predictions from the source to the target domain.In contrast, we consider transfer learning settings with matched features and targets in the target dataset.Tian and Feng (2022) proposed and implemented transfer learning for ridge and lasso regression.Their transfer learning algorithm involves two steps: (i) estimating common coefficients for the target dataset and the transferable source datasets ( x) and (ii) estimating the deviations from the common coefficients to the target coefficients ( d).Both steps together lead to the estimated target coefficients ( b ¼ x þ d).Before applying their transfer learning algorithm, Tian and Feng (2022) apply a transferable source detection algorithm to exclude source datasets that are too different from the target dataset.This avoids that non-transferable sources render the common coefficients misleading for the target dataset ('negative transfer').In the case of lasso regularization in the two steps, there is sparsity in the common estimates as well as in the deviations from the common estimates to the target estimates (and thereby also in the target estimates).
The method from Tian and Feng (2022) requires not only the target dataset but also the source dataset(s).However, data protection regulations or restrictive data sharing policies might prevent researchers from accessing a source dataset, or the available storage or processing capacity might be insufficient for analysing massive source datasets.Although federated transfer learning (Liu et al. 2020) enables the joint analysis of related datasets without sharing them, it is often not practically feasible for biomedical researchers to set up corresponding collaborative analyses with multiple data holders.There is therefore a need for transfer learning methods that require neither direct nor indirect access to the source data but only to the complementary data (co-data) derived from the source data.Such methods allow us to exploit publicly available summary statistics from external studies, e.g.P-values and effect sizes from a genomewide association study (GWAS), to increase the predictive performance in the study of interest.
We propose a two-step transfer learning method, modelling with and without co-data in the first step and combining different models in the second step.Unless the source and target datasets are very similar, the coefficients from the source dataset(s) will not fit well to the target dataset.We therefore propose to calibrate these coefficients-preserving their signs and their order-so that they can be transferred from the source dataset(s) to the target dataset.Additionally, we also estimate the coefficients directly from the target dataset, ignoring the co-data.The calibrated coefficients from the source dataset(s) as well as the estimated coefficients from the target dataset allow us to predict the outcome from the features.Finally, we combine the linear predictors from the models with and without co-data and calculate either predicted values (linear regression) or predicted probabilities (logistic regression).Kawaguchi et al. (2022) proposed a related transfer learning method.This method penalizes (i) the differences between the coefficients and weighted sums of the prior effects (one difference for each feature) and (ii) the weights for these weighted sums (one weight for each set of prior effects), possibly with two different penalties (e.g.ridge and lasso).It not only allows for high-dimensional data (i.e. more features than samples) but also for high-dimensional co-data (i.e. more sets of prior effects than samples).When the unknown feature effects are nonlinearly related to an important source of prior effects, however, this method might leave room for improvement.
In a related transfer learning setting, prior information is only available on the importance but not on the direction of the feature effects, i.e. with complementary data consisting of prior weights rather than prior effects.In the generalized linear model framework, the weighted lasso (Bergersen et al. 2011), the feature-weighted elastic net (Tay et al. 2023), and penalized regression with differential shrinkage (Zeng et al. 2021) account for prior weights in the penalty function, through feature-specific penalty factors or feature-specific regularization parameters.Adaptive group-regularized ridge regression (van de Wiel et al. 2016) is not only applicable to categorical co-data but also to numerical co-data (prior weights), by the means of creating groups of features from numerical co-data and forcing the group-penalties to be

Single-target learning
Multi-target learning Transfer learning y 1 . . .
a Single-target learning (left): same targets, same features Ã , different samples (from one population).Multi-target learning (centre): different targets, same features Ã , same samples.Transfer learning (right): same or different targets, matched features Ã , different samples (from one or two populations).Ã Partially overlapping feature spaces are also possible.See, for example, Rauschenberger and Glaab (2021) for multi-target learning or Section 2.8 "Extension" for transfer learning.monotonically decreasing.An extension from van Nee et al.
(2021) makes this approach even more suitable for numerical co-data.For single sources of co-data, it might be possible to extend these methods to prior information on the importance as well as the direction of feature effects by imposing sign constraints on the coefficients.Prior weights have not only been exploited in regression analysis, e.g.co-data moderated random forests (te Beest et al. 2017) adapt the sampling probabilities of the features to the prior weights.

Model
Suppose one target and p features are available for n samples, with many more features than samples (p ) n).We index the samples by i in f1; . . .; ng and the features by j in f1; . . .; pg.Our aim is to estimate the generalized linear model: For any sample i, the model expresses the expected value of its target (y i ) as a function of its features (x i1 ; . . .; x ip ).The link function hðÁÞ depends on the family of distributions for the target (Gaussian: identity, binomial: logit, Poisson: log).
In the linear predictor, b 0 represents the unknown intercept, and b j represents the unknown slope of feature j (i.e. the effect of the feature on the linear predictor of the target).Given the estimated intercept b? 0 and the estimated slopes f b? 1 ; . . .; b? p g, we could predict the target of previously unseen samples:

Co-data
Suppose m sources of co-data are available, indexed by k in f1; . . .; mg, with many fewer sources than samples (m ( n).
Let z jk indicate the prior effect from source k for feature j.
Our method is designed for quantitative co-data that provide an insight into the importance (absolute value) and the direction (sign) of the feature effects.Each set of prior effects (À1 z k 1) is assumed to be positively correlated with the true coefficients (corðz k ; bÞ > 0).For any source of codata, the prior effects may be re-scaled (not re-centred), for example to the interval from À1 to þ1.In other words, the proposed method is invariant under multiplication of the prior effects by a positive scalar (z !c Â z where c > 0).We explain in the next section why this is important.
It might seem trivial to also allow for co-data that only provide an insight into the importance but not the direction of the feature effects (i.e.prior weights instead of prior effects).Each set of prior weights (0 z k 1) is assumed to be positively correlated with the true absolute coefficients (corðz k ; jbjÞ > 0).To obtain prior effects, one might want to assign the signs of the Spearman correlation coefficients between the target and the features to the prior weights.However, marginal effects and conditional effects can have opposite signs.If we wanted to extend our approach to prior weights, we would have to discover the signs inside the calibration procedure (see below), which would be related to high-dimensional regression with binary coefficients (Gamarnik and Zadik 2017).

Base-learners with co-data
Suppose we are in a transfer learning setting with two prediction or classification problems.For simplicity, we assume that the features do not differ in scale between the two problems.For illustration, we consider two artificial situations (where we would not use transfer learning in practice): (i) if both problems concern the same target on the same scale and the samples come from the same population, we could use the estimated regression coefficients from one problem to make predictions for the other problem; and (ii) if the two problems concern the same target on different scales, we could also recycle the estimated regression coefficients, but we would have to adjust for the different scales.
When transferring estimated regression coefficients from one problem to another problem, it might not only be necessary to change their scale but it might also be beneficial to change their shape.For example, it might be that for one problem weak and strong effects matter, while for the other problem only strong effects matter.We should therefore also be able to make differences between small coefficients more important or less important than those between large coefficients.We propose two calibration methods, namely exponential and isotonic calibration, to adapt the prior information to the data.For each source of co-data k, both calibration methods estimate the model: where the calibrated prior effects fĉ 1k ; . . .; ĉpk g depend on the initial prior effects fz 1k ; . . .; z pk g.The difference between exponential and isotonic calibration is how the former depend on the latter.
• Exponential calibration: Let c jk ¼ h k signðz jk Þjz jk j s k , for j in f1; . . .; pg, where the factor h k and the exponent s k are non-negative real numbers (h k !0 and s k !0).We first fit one simple non-negative regression for different values of s k (i.e.estimate a k and h k given s k ), and then optimize s k .Once h k and s k have been estimated, the initial prior effects z jk determine the final prior effects ĉjk ¼ ĥk signðz jk Þjz jk j ŝk , for all j in f1; . . .; pg.The estimated factor ĥk and the estimated exponent ŝk allow the model to change the scale and the shape of the prior effects.For example, ĥk ¼ 0 sets them to zero, j ĥk j < 1 makes them smaller, j ĥk j > 1 makes them larger, ŝk ¼ 0 sets them to the same value, ŝk < 1 makes (absolutely) large ones more similar, and ŝk > 1 makes (absolutely) small ones more similar.If one or more sets of prior effects might be negatively associated with the true coefficients, we could remove the non-negativity constraints from the simple regressions (allowing ĥk < 0 to invert the signs of the prior effects).• Isotonic calibration: We estimate fc 1k ; . . .; c pk g under the constraint that the signs of the initial prior effects z jk determine the signs of the final prior effects ĉjk (i.e.ĉjk ¼ 0jz jk ¼ 0, ĉjk !0jz jk > 0; ĉjk 0jz jk < 0) and under the constraint that the order of the initial prior effects determines the order of the final prior effects (i.e. Penalized regression with multiple sources of prior effects ĉjk !ĉlk jz jk !z lk ; ĉjk ĉlk jz jk z lk ), for all j and l in f1; . . .; pg.If one or more sets of prior effects might be negatively associated with the true coefficients, we could fit each model with these constraints and the inverted constraints, and then select the better fit.
• To make optimization more efficient, we rewrite the signand order-constrained problem as a sign-constrained problem (see Table 2).For each source of co-data, we order the columns of the feature matrix by increasing values of the prior effects.Suppose the first q columns correspond to negative prior effects and the last p-q columns correspond to non-negative prior effects.We take the cumulative sum of the feature columns from left to right for the former (columns 1 to q) and from right to left for the latter (columns p to q þ 1).We then estimate the coefficients on the left under non-positivity constraints, and those on the right under non-negativity constraints.Formally, the model equals where w ij ¼ P j l¼1 x iðlÞ and d jk 0 for j in f1; . . .; qg, and w ij ¼ P p l¼j x iðlÞ and d jk !0 for j in fq þ 1; . . .; pg, with the subscript within brackets indicating the order of the prior effects.The linear predictor of the sign-constrained model, i.e. a k þ P p j¼1 d jk w ij , is equivalent to the linear predictor of the order-constrained model, i.e. a k þ P p j¼1 c ðjÞk x iðjÞ , because After estimating the coefficients of the sign-constrained model by maximum likelihood, we therefore estimate those of a The aim is to (i) estimate the effects of the features under sign and order constraints determined by q negative (top) and p-q non-negative (bottom) prior effects, i.e. estimate c 1k ; . . .; c pk for x 1 ; . . .; x p under ĉjk ¼ 0jz jk ¼ 0, ĉjk !0jz jk > 0; ĉjk 0jz jk < 0; ĉjk !ĉlk jz jk !z lk , and ĉjk ĉlk jz jk z lk .This can be solved by (ii) estimating the effects of the features ordered by the co-data under sign and order constraints, i.e. estimate c ð1Þ;k ; . . .; c ðpÞ;k for x ð1Þ ; . . .; x ðpÞ under ĉðjÞ;k 0jj p; ĉðjÞ;k !0jj > p, and ĉð1Þ;k . . .ĉðpÞ;k .This in turn can be solved by (iii) estimating the effects of the combined features under sign constraints, i.e. estimate d 1 ; . . .; d p for w 1 ; . . .; w p under dj 0jj p and dj !0jj > p.Our algorithm receives the original features and the prior effects (i), orders the features by the prior effects (ii), combines the features (iii), estimates the effects of the combined features (iii) Ã , calculates the estimated effects of the ordered features (ii) Ã , and returns the estimated effects of the original features (i) Ã .Ã A row with an asterisk contains the estimates for the features in the row with the same number but without an asterisk.
the order-constrained model by ĉðjÞk ¼ P q l¼j dlk for j in f1; . . .; qg and ĉðjÞk ¼ P j l¼qþ1 dlk for j in fq þ 1; . . .; pg.While exponential calibration involves three unknown parameters, namely the intercept a k , the factor h k , and the exponent s k , isotonic calibration involves 1 þ p unknown parameters, namely the intercept a k and the slopes c k ¼ fc 1k ; . . .; c pk g, for each set of co-data.Figure 1 shows the difference between exponential and isotonic calibration in several empirically assessed scenarios.
After calibration, we pre-assess the utility of each set of codata.To do this, we calculate the residuals (depending on the family of distributions) between the fitted and the observed targets.We suggest to retain a set of co-data only if the residuals are significantly smaller than those of the intercept-only model (one-sided Wilcoxon signed-rank test) at the nominal 5% level (P-value :05).

Base-learners without co-data
We also fit the model without any co-data.We estimate the coefficients by maximizing the penalized likelihood: where Lðx; bÞ is the likelihood and qðk; bÞ is the penalty.The likelihood depends on the family of distributions (Gaussian, binomial, Poisson), and the penalty can be the ridge (L 2 ) or the lasso (L 1 ) penalty.The penalty shrinks the squared (ridge) or absolute (lasso) slopes fb 1 ; . . .; b p g towards zero (without penalizing the intercept b 0 ).We denote the estimated intercept by b0 and the estimated slopes by f b1 ; . . .; bp g.

Cross-validation
We split the samples into 10 folds to perform 10-fold internal cross-validation.In each iteration, we fit the models to nine included folds and predict the target for the excluded fold.
Let the n Â m matrix Ĥ ð0;cvÞ represent the featuredependent part of the cross-validated linear predictors from the models with co-data.Specifically, the entry in row i (sample) and column k (source of co-data) equals where the superscript ÀjðiÞ indicates that the (ignored) intercept a k and the slopes c jk for j in f1; . . .; pg are estimated without using the fold of sample i, as in Rauschenberger and Glaab (2021).
The models without any co-data do not only have 1 þ p unknown parameters, namely the intercept b 0 and the slopes fb 1 ; . . .; b p g, but also the unknown hyperparameter k.In each iteration, we fit this model for a decreasing sequence of 100 values for the regularization parameter k, indexed by l in Figure 1.Final prior effects c (y-axis) against initial prior effects z (x-axis), under exponential calibration (red points on continuous curve) and isotonic calibration (blue points on discontinuous curve).The thin black line corresponds to perfect calibration (c ¼ b).We simulated the feature matrix X from a standard Gaussian distribution (n ¼ 200, p ¼ 500) and the initial prior effects z from a trimmed standard Gaussian distribution (trimmed below the 1% and above the 99% quantile).We set the true coefficients to (1 And we simulated the response vector y from Gaussian distributions with the means g and the variance VarðgÞ, where g ¼ b.While exponential calibration performs slightly better in the first three scenarios (top), isotonic calibration performs much better in the last three scenarios (bottom).
Accordingly, let the n Â 100 matrix Ĥ ð1;cvÞ represent the cross-validated linear predictors from the model without codata.Specifically, the entry in row i (sample) and column l (regularization parameter) equals where the superscripts ÀjðiÞ and l indicate that the intercept b 0 and the slopes fb 1 ; . . .; b p g are estimated without using the fold of sample i and given the regularization parameter k l .
To optimize the predictive performance of the co-data independent model, we would select the k that minimizes the cross-validated loss (k min ).As we base our predictions not only on the co-data independent model but also on the codata dependent model(s), k min might be too small.The reason is that the co-data might be informative to the extent that the co-data independent model requires more penalization.We could let the meta-learner select the optimal k from the whole sequence, but this might render the inclusion and exclusion of co-data dependent models unstable.Our ad hoc solution is to include the optimal regularization parameter for the co-data independent model (k min ) and a slightly larger one (k 1se ).The latter is given by the one-standard-error rule, which increases k until the cross-validated loss equals its minimum plus one standard error.
We concatenate Ĥ ð0;cvÞ with the columns of Ĥ ð1;cvÞ that correspond to k min and k 1se to obtain the n Â ðm þ 2Þ matrix Ĥ ðcvÞ .The first m columns correspond to the models with codata, and the last two columns correspond to the model without co-data.

Meta-learner
We combine the base-learners with and without co-data by stacked generalization (Wolpert 1992), on the level of the linear predictors (Rauschenberger et al. 2021).In the meta-layer, we regress the target on the cross-validated linear predictors from the base-layer: Leaving the intercept unrestricted ðÀ1 < x 0 < þ1Þ but imposing the lower bound zero on the slopes ðx 1 !0; . . .; x mþ2 !0Þ, we estimate these coefficients under lasso regularization.Due to the feature selection property of the lasso, a source of co-data can be excluded ( xk ¼ 0) or included ( xk > 0), where k is in f1; . . .; mg.Similarly, the models without co-data can be excluded ( where k ¼ m þ 1 for the model with k min and k ¼ m þ 2 for the model with k 1se .The estimated slopes function as weights for the co-data dependent models (x 1 ; . . .; xm ) and for the codata independent models (x mþ1 ; xmþ2 ).Thus, we do not only select sources but also weight them according to their relevance.

Interpretation
The coefficients bmin and b1se give insight into the featuretarget effects estimated without co-data, with bmin;j and b1se;j representing the effect of feature j, where j is in f1; . . .; pg.
The coefficients x give insight into the importance of the sources of co-data, with xk representing the importance of source k, where k is in f1; . . .; mg.For a previously unseen sample i, the predicted value is: Thus, the estimated effect for a feature ( b? j ) is a weighted sum of estimated coefficients with co-data (ĉ j1 ; . . .; ĉjm ) and the estimated coefficients without co-data ( bmin;j ; b1se;j ).
Sparse models (few non-zero coefficients) are often considered to be more interpretable than dense models (many nonzero coefficients).While the original coefficients are dense ( P p j¼1 I½ bj 6 ¼ 0 ¼ p) or sparse ( P p j¼1 I½ bj 6 ¼ 0 ( p) depending on the choice between ridge and lasso regularization, the weights may contain some zeros due to significance filtering or lasso regularization ( . As soon as one set of dense prior effects is selected, however, the combined coefficients also become dense ( P p j¼1 I½ b? j 6 ¼ 0Շp).This means that the feature selection property of the lasso is not maintained.We should therefore choose between ridge and lasso regularization (i) to make the model without co-data more predictive or interpretable (ii) or to make the model with co-data more predictive (iii) but not to make the model with co-data more interpretable.

Extension
In some applications, prior information might be available and reliable for some features but missing or unreliable for other features.Note that a missing prior effect can be interpreted as an unreliable prior effect of zero.Although the base-learners with co-data might still be predictive, the metalearner (weighted average of the base-learners with and without co-data) might be not more predictive than the baselearner without co-data.The reason is that the meta-learner assigns the same weight to all prior effects, rather than more weight to available and reliable prior effects and less weight to missing or unreliable prior effects.We therefore propose an alternative approach for applications with partially informative sources of co-data.
In the following, we use the term 'meta-features' for the cross-validated linear predictors from the base learners with co-data.Each meta-feature-one column of the n Â m matrix Ĥ ð0;cvÞ -corresponds to one source of co-data.In the metalayer, we regress the target on the meta-features and the basefeatures: with non-negativity constraints for the weights for the metafeatures ðx 1 !0; . . .; x m !0Þ but without constraints for the intercept (b 0 ) and the slopes for the base-features ðb 1 ; . . .; b p Þ.
We estimate the weights for the meta-features and the slopes for the base-features using penalized maximum likelihood: f x; bg ¼ argmax fx;bg fLðx; x; bÞ À qðk; bÞg ; where Lðx; x; bÞ is the likelihood and qðk; bÞ is the penalty.We do not penalize the weights for the meta-features (m ( n) but only the slopes for the base-features (p ) n).The more sources of co-data are available, the more it becomes necessary to penalize their weights.But then the weights x and the slopes b might need differential penalization, for example a lasso penalty for the meta-features (selection of sources) and a ridge penalty for the base-features (many small effects).To make this computationally efficient, we would need a fast cross-validation procedure for multiple penalties (cf.van de Wiel et al. 2021) with non-negativity constraints (meta-features) and mixed lasso and ridge penalization (meta-features versus base-features).This extension is therefore only applicable in settings with few sources of co-data.
The predicted value for a previously unseen sample i is As the coefficients b are shrunk towards zero but the coefficients x are not penalized, the combined coefficients b ?are shrunk towards a weighted sum of the calibrated prior effects (ĉ x).When the regularization parameter tends to infinity (k !1), the estimated deviations from this weighted sum approach zero ð b !0Þ and the combined estimates approach this weighted sum ð b? !ĉ xÞ.Lasso regularization ensures sparsity in the deviations from the calibrated prior effects ( P p j¼1 I½ bj 6 ¼ 0 ( p)-in contrast to ridge regularizationbut not in the combined coefficients ( P p j¼1 I½b ?j 6 ¼ 0 Շ p).As the combined coefficients may deviate more from unreliable than from reliable calibrated prior effects, this extension is suitable for partially informative co-data.As opposed to 'standard stacking', we refer to this extension as 'simultaneous stacking'.Algorithm 1 includes the pseudo-code for both approaches.

Simulation
We performed two simulation studies to compare the predictive performance between our transfer learning method and the one from Tian and Feng (2022).In contrast to the method from Tian and Feng (2022), which requires the feature-target effects in the target and the source dataset(s) to be positively correlated and on the same scale (i.e.b target % b source ), our method also allows for negatively correlated effects and for effects on different scales (i.e.b target % c Â b source ).Although it is possible to overcome this restriction by inverting the target (Gaussian: y source !Ày source , binomial: y source ! 1 À y source ), by re-scaling the target (Gaussian: y source !1=c Â y source ), or by inverting or re-scaling the features (X source !c Â X source ), we believe it is more user-friendly to directly allow for negative correlations and different scales.To ensure a fair comparison between the two methods, we simulate positively correlated effects on the same scale.Furthermore, although the method from Tian and Feng (2022) is in theory also suitable for mixed response types, the current version of the related R package glmtrans requires the source dataset(s) and the target dataset to have the same response type (Gaussian, binomial, Poisson).We therefore always simulate the same response type in the source and target domains.In addition, we also compared our method with the one from Kawaguchi et al. (2022, xrnet).

External simulation
We use the simulation approach from Tian and Feng (2022).
In each iteration, we call the function glmtrans::models with the arguments (1) family of distributions: family¼ "gaussian" (default) or family¼"binomial", (2) source or target datasets: type¼"all" (default), (3) difference between source and target coefficients: h ¼ 5 (default) or h ¼ 250, (4) number of source datasets: K ¼ 5 (default), ( 5) sample size for target dataset: n.target ¼ 100 (default), ( 6) sample size for each source dataset: n.source ¼ 150 (default), ( 7) number of non-zero coefficients: s ¼ 15 (default) or s ¼ 50, (8) number of features: p¼1000 (default), number of transferable source datasets: The simulation from Tian and Feng (2022) involves the following steps: • Features: The correlation between features i and j is set to R ij ¼ 0:5 jiÀjj , where i and j are in f1; . . .; pg.Let R represent the correlation matrix and let R ¼ R > R represent its Cholesky decomposition, where R is an upper triangular matrix.For the target dataset (n 0 ¼ 100) and each source dataset (n 1 ¼ Á Á Á ¼ n 5 ¼ 150), the n 0 Â p matrix X 0 ¼ E 0 R and the n k Â p matrices X k ¼ E k R for k in f1; . . .; 5g represent the features, where the n 0 Â p matrix E 0 and the n k Â p matrices E k contain Gaussian noise.• Coefficients: Let b j represent the effect of feature j, for j in f1; . . .; pg, and denote the p-dimensional coefficient vectors by b 0 for the target dataset and fb 1 ; . . .; b 5 g for the source datasets.For the target dataset, the first s elements are set to b j ¼ 0:5 (causal) and the last p-s elements are set to b j ¼ 0 (non-causal).For transferable source datasets, the first s elements are set to b j ¼ 0:5 þ ðÀ1Þ zj h=p and the last p-s elements are set to b j ¼ ðÀ1Þ zj h=p, where z j is a realization of z j $ Bernoullið0:5Þ.For non-transferable source datasets, the first s elements are set to b j ¼ ðÀ1Þ zj 2h=p (non-causal), the next s elements are set to b j ¼ 0:5 þ ðÀ1Þ zj 2h=p (causal), and the last p À 2s elements are a random sample of s causal and p À 3s noncausal elements generated in the same way.To obtain a non-transferable source, it would be sufficient to randomly select causal elements rather than inverting causal and non-causal elements (indices 1 to 2s).• Targets: In the Gaussian case, the target vector is the n-dimensional vector y 0 ¼ X 0 b 0 þ 0 for the target dataset, and the n-dimensional vector for source dataset k, where the n-dimensional vectors f 0 ; . . .; 5 g contain Gaussian noise.In the binomial case, let p 0 ¼ 1=ð1 þ expðÀX 0 b 0 ÞÞ for the target dataset and p k ¼ 1=ð1 þ expðÀ0:5 À X k b k ÞÞ for the source datasets.
The n-dimensional vectors fy 0 ; . . .; y 5 g are the target Penalized regression with multiple sources of prior effects vectors, with each element following a Bernoulli distribution with the probability given by fp 0 ; . . .; p 5 g.

Internal simulation
The simulation from Tian and Feng (2022) uses the same effect size for all causal features and a decreasing correlation structure with a fixed base.We therefore designed our own data-generating mechanism (i) to simulate different effect sizes for different causal features (b j 2 R instead of b j 2 f0; 0:5g) and (ii) to vary the strength of correlation between features (r ij ¼ q x jiÀjj instead of r ij ¼ 0:5 jiÀjj ).Our simulation involves the following steps: • Features: Setting the mean of feature i to l i ¼ 0, the variance of feature i to r ii ¼ 1, and the covariance between Algorithm 1. Pseudo-code for the proposed transfer learning method with standard stacking (left) and simultaneous stacking (right).Sub-procedures (italicized) are explained below.
• Exp/Iso-Calibrate † : Performs exponential or isotonic calibration (Section).Requires target vector, feature matrix, and prior effects.Ensures calibrated prior effects.• Ridge/Lasso-CV † : Tunes the hyperparameter of ridge or lasso regression by k-fold cross-validation.Requires target vector, feature matrix, and fold identifiers.Ensures optimal regularization parameter.• Ridge/Lasso-Fit † : Estimates the parameters of ridge or lasso regression.Requires target vector, feature matrix, and regularization parameter.Ensures estimated coefficients.• Sta/Sim-Combine: Combines estimated parameters from standard stacking (Equation 1) or simultaneous stacking (Equation 2).Requires estimated parameters from base-learners and meta-learner.Ensures combined estimates.† Note: The procedures Exp/Iso-Calibrate, Ridge/Lasso-CV, and Ridge/Lasso-Fit depend on the family of distributions (Gaussian versus binomial), but they also require choices (exponential versus isotonic, ridge versus lasso).
features i and j to r ij ¼ q jiÀjj x , for all i and j in f1; . . .; pg, we simulate multiple feature matrices from the multivariate Gaussian distribution with mean vector l and covariance matrix R, namely the n 0 Â p feature matrix X 0 for the target dataset, and the n 1=2=3 Â p feature matrices fX 1 ; X 2 ; X 3 g for the source datasets (n 0 ¼ 100; n 1 ¼ n 2 ¼ n 3 ¼ 150; p ¼ 500).
• Coefficients: Setting the mean and the variance for dataset k to l k ¼ 0 and r kk ¼ 1, for all k in f0; 1; 2; 3g, and the covariance between datasets k and l to r kl ¼ 0 if either k or l equals 1, or to r kl ¼ q b if both k and l are in f0, 2, 3g, we simulate two p Â 4 matrices from the multivariate Gaussian distribution with mean vector l and covariance matrix R, namely B 1 and B 2 .We define the coefficients as B ¼ B 1 I½B 2 > / À1 ð1 À pÞ, where / is the Gaussian cumulative distribution function and p equals 0.2 (dense) or 0.05 (sparse), and denote the p-dimensional coefficient vectors by b 0 for the target dataset and by fb 1 ; b 2 ; b 3 g for the source datasets.While one set of coefficients is non-transferable (b 1 ), we transform the transferable sets of coefficients with b 2 !signðb 2 Þjb 2 j 2 and b 3 !signðb 3 Þ ffiffiffiffiffiffiffi ffi jb 3 j p .• Targets: For the target dataset, we compute z 0 ¼ X 0 b 0 and standardize z 0 to obtain z Ã 0 .For the source datasets, we proceed similarly to obtain fz Ã 1 ; z Ã 2 ; z Ã 3 g.The simulated targets equal where hðÁÞ is a link function and k follows a standard Gaussian distribution, for k in f0; . . .; 3g.Given 0 While hðÁÞ is the identity link in the Gaussian case, it is the logit link in the binomial case, where the simulated probabilities are rounded to simulated classes.We set the signal-to-noise ratio to 4:1 (w ¼ 0.8).

Simulation results
In addition to the target dataset, the method from Tian and Feng (2022, glmtrans) requires the source datasets, while our method (transreg) and the method from Kawaguchi et al. (2022, xrnet) require the prior effects derived from the source datasets.Since these methods have different requirements, we first simulate the source datasets (for glmtrans) and then derive the prior effects from the simulated source datasets (for xrnet and transreg).As prior effects, we use the estimated coefficients from penalized regression on the source datasets.We choose the type of regularization for all methods subject to the simulation setting, namely ridge regularization for dense settings (glmtrans: a ¼ 0; xrnet and transreg: a source ¼ a target ¼ 0) and lasso or lasso-like elastic net regularization for sparse settings (glmtrans: a ¼ 1; xrnet and transreg: a source ¼ 0:95; a target ¼ 1).The idea of the lasso-like elastic net regularization is to render the prior information more stable.
In each simulation setting, we simulate 100 training samples and 10 000 testing samples (hold-out) for the target dataset.3 and 4 show the predictive performance for the testing data in the external and internal simulation, under exponential and isotonic calibration.We observe that transfer learning (with glmtrans, xrnet, or transreg) often leads to a significant improvement with respect to standard penalized regression (with glmnet).Concerning the proposed method, this holds for the two different calibration approaches and the two different stacking approaches.These simulation studies do not show that the proposed transfer learning method outperforms other transfer learning methods.
The advantage of the proposed method as compared to the method from Tian and Feng (2022, glmtrans) is that it does not require the source data but only the prior effects derived from the source data, and the advantage as compared to the method from Kawaguchi et al. (2022, xrnet) is that it allows for non-linear relationships between the prior effects and the true effects.While the method from Kawaguchi et al. (2022, xrnet) shrinks the coefficients towards a linear function of the prior effects, the proposed method adapts the prior effects to the true effects through exponential or isotonic calibration.

External applications
First, we consider an adapted version of the application on cervical cancer from van de Wiel et al. (2016).The aim is to transfer information from a methylation study with biopsy samples to another methylation study with self-collected samples in order to better discriminate between low-grade and high-grade precursor lesions.Specifically, we transfer the signs of the effect sizes and the P-values from the source dataset to the target dataset (n ¼ 44 samples, p ¼ 9491 features).We then examine whether this prior information increases the predictive performance of ridge regression, which is more predictive than lasso regression in this application.Next to our transfer learning method (transreg) and the one from Kawaguchi et al. (2022, xrnet), we consider the co-data learning methods from Tay et al. (2023, fwelnet) and van Nee et al. (2021, ecpc).While these transfer learning methods exploit information on the importance and direction of the effects (co-data: ÀsignðcoefÞ log 10 ðP À valueÞ), these codata learning methods only exploit information on their importance (co-data: À log 10 ðP À valueÞ).After 10 repetitions of 10-fold cross-validation, we observe that the proposed method (not with exponential but with isotonic calibration) often increases the predictive performance of ridge regression (transreg.exp.sta:0/10, transreg.exp.sim: 4/10, transreg.iso.sta:7/10, transreg.iso.sim: 10/10, fwelnet: 7/10, ecpc: 5/10, xrnet: 3/10).We also observe that exploiting information on the importance as well as the direction of the effects (transreg) can be more beneficial than exploiting information on the importance of the effects only (fwelnet, ecpc), as can be seen in the mean change in cross-validated logistic deviance (transreg.exp.sta:þ6:56%, transreg.exp.sim:þ0:43%, transreg.iso.sta:À2.50%, transreg.iso.sim:À9.25%, fwelnet: À0.12%, ecpc: À1.52%, xrnet: þ2:59%).The relatively poor performance of the competing transfer learning method (xrnet) might be caused by a non-linear relationship between the prior effects (transformed P-values) and the true effects.Here, isotonic calibration outperforms exponential calibration, and simultaneous stacking outperforms standard stacking.A potential explanation for the large difference in performance between exponential and isotonic calibration is that positive effects might be more important than negative effects in this application, for a biological reason (methylation increases the probability of cancer) and a statistical reason (effects of overexpression are easier to detect than those of underexpression).While exponential calibration behaves symmetrically for negative and positive prior effects, isotonic calibration can shrink negative prior effects towards zero.
Penalized regression with multiple sources of prior effects Second, we consider an adapted version of the application on pre-eclampsia from Tay et al. (2023).Measurements of p ¼ 1125 plasma proteins are available for n ¼ 166 patients at multiple time points (48 The aim is to transfer information from late time points (gestational age > 20 weeks) to early time points (gestational age 20 weeks).We repeatedly split the patients into one source dataset and one target dataset.Patients with only late time points are always in the source dataset, and other patients are randomly allocated to the source and the target dataset.(Note that this application is somewhat artificial, as it might be better to drop transfer learning in favour of using all earliest time points in the regression of interest.)Using the source dataset, we estimate two logistic regression models under ridge regularization, once using the early time points and once using all time points.For each patient, all time points are assigned to the same cross-validation fold, and the weight is split evenly among the time points.We then use the two sets of estimated regression coefficients as co-data for the target dataset.In the regression for the target dataset, we only include the earliest time point of each patient.Using 10-fold cross-validation, we estimate the predictive performance of ridge regression with and without transfer learning.After repeating source-target splitting and cross-validation 10 times, we observe that transfer learning tends to decrease the cross-validated logistic deviance (transreg.exp.sta:8/10, transreg.exp.sim:7/10, transreg.iso.sta:7/10, transreg.iso.sim:9/10, fwelnet: 5/10, ecpc: 6/10, xrnet: 10/10).It is more beneficial to share information not only on the importance but also the direction of the effects, according to the mean change in crossvalidated logistic deviance (transreg.exp.sta:À2.61%, transreg.exp.sim:À4.29%, transreg.iso.sta:À3.67%, transreg.iso.sim:À8.33%, fwelnet: þ0:04%, ecpc: À6.87%, xrnet: À12.12%).Simultaneous stacking again outperforms standard stacking, but exponential and isotonic calibration show a similar performance.In this application, where the prior effects are probably close to the true effects, the competing transfer learning method (xrnet) outperforms the proposed one.
For both applications, Fig. 2 shows the change in predictive performance from modelling without to modelling with prior information.

Internal application
In this application, we transfer information from a metaanalysis of genome-wide association studies on Parkinson's disease (PD-GWAS, Nalls et al. 2019) to the Luxembourg Parkinson's study (LuxPARK, Hipp et al. 2018).The aim is to classify samples into Parkinson's disease (PD) patients and healthy controls based on SNPs.
At the time of our study, the LuxPARK dataset included genotyping and clinical data of 790 PD cases and 766 healthy  (Blauwendraat et al. 2017).Quality control steps of genotyping data were conducted according to the standard procedures reported previously (Pavelka et al. 2022).Missing genotyping data were imputed using the reference panel from the Haplotype Reference Consortium (release 1.1) on the Michigan Imputation Server (Das et al. 2016) (RRID: ID_017579), with a filter for imputation quality (r 2 > 0.3).
As common SNPs exhibit weak effects on PD, the sample size is likely insufficient to train a highly predictive model.However, publicly available summary statistics from the largest-to-date PD-GWAS (with around 38 000 cases and 1 400 000 controls from European ancestry) (Nalls et al. 2019) might serve as prior information on the SNP effects.For each SNP, these summary statistics are the combined results from simple logistic regression of the PD status on the SNP, namely the estimated slope (logarithmic odds ratio), its standard error, and the associated P-value.Importantly, the LuxPARK cohort was not part of the PD-GWAS, meaning that the prior information comes from independent data.As the LuxPARK cohort and the PD-GWAS cohorts have a similar ethnic background, the prior information might allow us to increase the predictive performance.
The two lists of SNPs-from the LuxPARK genotyping data (target dataset) and the PD-GWAS summary statistics (source dataset)-are partially overlapping.SNP data are high-dimensional and strongly correlated.From each block of SNPs in the target dataset (250 kb window), we retain the most significant one and those that are in weak pairwise linkage disequilibrium with it (r 2 < 0:1).Next, we only retain the SNPs appearing also in the source dataset.These two filtering steps together reduce the dimensionality in the target dataset from around 18 SNPs to 196 018 SNPs.We code the SNP data for dominant effects, with 0 meaning no alternate allele (0/0) and 1 meaning one or two alternate alleles (0/1 or 1/1).
It seems that the results from the source dataset are informative, because 5.80% of the P-values are nominally significant at the 0.05 level (11 377 out of 196 018), 77 are significant at a false discovery rate of 5% (Benjamini-Hochberg), and 35 are significant at a family-wise error rate of 5% (Holm-Bonferroni).As SNPs with a low minor allele frequency might have large effect sizes but insignificant P-values, we base the prior effects not on the estimated coefficients ( b) but on the signed logarithmic P-values (Àsignð bÞ log 10 ðPÞ).For each SNP, we compared the reference and the alternate alleles between the two datasets: (i) if both datasets have the same reference allele and the same alternate allele, the signed logarithmic P-value from the source dataset becomes the prior effect for the target dataset; (ii) if the reference allele of each dataset is the alternate allele of the other dataset (swapped alleles), we invert the sign of the signed logarithmic P-value; and (iii) if the two datasets have two different sets of alleles (multiallelic SNP), we set the prior effect to zero.
Rather than using the 196 018 SNPs for predictive modelling in the target dataset, we also filter them based on their significance in the source dataset (which is already a type of transfer learning).For each cut-off in f5 Â 10 À2 ; 5 Â 10 À3 ; . . .; 5 Â 10 À10 g, we exclude all SNPs above and include all SNPs below.This means that for the target dataset, we retain a specific number of the most significant SNPs from the source dataset.For each significance cut-off, we compare three modelling approaches: • Uninformed approach: We use logistic regression with ridge or lasso penalization to model the PD status based on the included SNPs.All included SNPs are treated equally, irrespective of their estimated effect in the source dataset.
• Naı ¨ve transfer learning: After calculating for each sample the sum across the signed logarithmic P-values from the source dataset multiplied by the SNPs from the target dataset, we fit a simple logistic regression of the PD status on this sum.• Transfer learning: The proposed transfer learning approach uses the signed logarithmic P-values from the source dataset as prior effects for the target dataset.
Figure 3 shows the predictive performance of modelling with estimated effects (uninformed approach), with prior effects (naı ¨ve transfer learning), or with both (transfer learning).We obtained the results with repeated nested crossvalidation (10 repetitions, 10 external folds, 10 internal folds), using the same folds for all methods.If the significance cut-off is very strict, leading to a small number of significant SNPs, transfer learning does not improve the predictive performance of ridge and lasso regression.In these lowdimensional settings with many fewer SNPs than samples, prior information on the SNPs is not helpful.But otherwise transfer learning does improve the predictive performance of ridge and lasso regression.This holds for all four flavours of the proposed transfer learning method (exponential versus isotonic calibration, standard versus simultaneous stacking), but isotonic calibration works considerably better than exponential calibration and simultaneous stacking works marginally better than standard stacking.
Depending on the significance cut-off determining the number of significant SNPs, the performance of naı ¨ve transfer learning can be as high as the one of transfer learning with isotonic calibration.In these cases, the prior effects are predictive to the extent that it is not even necessary to estimate any effects.An explanation for the high performance of naı ¨ve transfer learning might be (i) the large sample size in the source dataset for testing the marginal effects of the SNPs together with (ii) the linkage disequilibrium clumping leading to a selection of relatively independent SNPs.

Discussion
We proposed a two-step transfer learning method for exploiting estimated coefficients from related studies to improve the predictive performance in the study of interest.First, we adapt  the prior effects from the source datasets to the target dataset, either with exponential or isotonic calibration.While exponential calibration is more robust to outliers (only three free parameters), isotonic calibration is more flexible (only maintains order of prior effects).We expect the former to be superior if the prior effects are close to the true effects, and the latter to be superior if there is no exponential relationship.Second, we combine the calibrated prior effects with information from the observed data, based on two variants of stacked generalization.While the first variant (standard stacking) is more suitable if there are many sources of co-data ('averaging calibrated prior effects and estimated effects'), the second variant (simultaneous stacking) is more suitable if there is one source of co-data with partially unreliable or partially missing prior effects ('shrinking combined effects towards calibrated prior effects').
The proposed transfer learning method allows for multiple sources of prior information.It does not require the source dataset(s) but only the prior effects derived from the source dataset(s).The proposed sequential transfer learning method has a competitive predictive performance with the less flexible parallel transfer learning method from Tian and Feng (2022, glmtrans).In the case of closely related tasks, accounting for prior effects with transfer learning seems to be more beneficial than accounting for prior weights with the co-data methods from Tay et al. (2023, fwelnet) and van Nee et al. (2021, ecpc).And in the case of non-linearly related prior effects, the proposed method seems to outperform the one from Kawaguchi et al. (2022, xrnet).We therefore believe that the proposed method could tackle many biomedical predictions problems with one or more sets of prior effects.
In some applications, only one type of prior information derived from the source datasets is available.In other applications, multiple types of prior information are available (or the source datasets themselves).Then we can choose from multiple types of prior information.If the source and target datasets have the same feature space, estimated coefficients from penalized regression might be a reasonable choice.If the feature spaces are different, however, it is problematic that (i) lasso regression erratically selects among correlated features and (ii) ridge regression distributes weight among correlated features.This means that the presence or absence of additional correlated features in the source datasets might change the prior information on the features of interest.The same problem arises under contamination of a subset of features (van de Wiel et al. 2016).We therefore expect that signed logarithmic P-values from pairwise testing (Àsignð bÞ log 10 ðPÞ) will often be more informative than estimated coefficients from multiple regression ( b).  3. Predictive performance in internal application.Mean cross-validated area under the ROC curve (AUC) from 10 times 10-fold cross-validation (y-axis) against P-value cutoff (x-axis) for regression without (solid line) and with (dashed lines) transfer learning (bright blue: standard stacking, dark blue: simultaneous stacking), under either ridge (left) or lasso (right) regularization and either exponential (top) or isotonic (bottom) calibration.The numbers within brackets indicate the dimensionality, and the dotted line is for naı ¨ve transfer learning.While AUC ¼ 0:5 represents random classification, AUC ¼ 1:0 represents perfect classification.Given 766 controls and 790 cases, a random classifier achieves an AUC above 0.524 in 5% of the cases.
Penalized regression with multiple sources of prior effects

Figure 2 .
Figure2.Change in predictive performance in external applications due to prior information.Percentage change in cross-validated logistic deviance (y-axis) from ridge regression (with glmnet) to other methods (x-axis), for 10 repetitions of 10-fold cross-validation.A negative value (blue) indicates an improvement, and a positive value (red) indicates a deterioration, based on the metric 'logistic deviance'.Top: application on cervical cancer.Bottom: application on pre-eclampsia.From left to right: co-data learning methods (fwelnet, ecpc), competing transfer learning method (xrnet), proposed transfer learning method (transreg) with exponential/isotonic calibration and standard/simultaneous stacking.

Table 1 .
Abstract representation of the dataset of interest (without asterisk, black) and an additional dataset (with asterisk, grey).a

Table 3 .
Predictive performance in external simulation.a

Table 4 .
Predictive performance in internal simulation.a