Fractional ridge regression: a fast, interpretable reparameterization of ridge regression

Abstract Background Ridge regression is a regularization technique that penalizes the L2-norm of the coefficients in linear regression. One of the challenges of using ridge regression is the need to set a hyperparameter (α) that controls the amount of regularization. Cross-validation is typically used to select the best α from a set of candidates. However, efficient and appropriate selection of α can be challenging. This becomes prohibitive when large amounts of data are analyzed. Because the selected α depends on the scale of the data and correlations across predictors, it is also not straightforwardly interpretable. Results The present work addresses these challenges through a novel approach to ridge regression. We propose to reparameterize ridge regression in terms of the ratio γ between the L2-norms of the regularized and unregularized coefficients. We provide an algorithm that efficiently implements this approach, called fractional ridge regression, as well as open-source software implementations in Python and matlab (https://github.com/nrdg/fracridge). We show that the proposed method is fast and scalable for large-scale data problems. In brain imaging data, we demonstrate that this approach delivers results that are straightforward to interpret and compare across models and datasets. Conclusion Fractional ridge regression has several benefits: the solutions obtained for different γ are guaranteed to vary, guarding against wasted calculations; and automatically span the relevant range of regularization, avoiding the need for arduous manual exploration. These properties make fractional ridge regression particularly suitable for analysis of large complex datasets.


Introduction
Consider the standard linear model setting Y = Xβ solved for β, where Y is the (d, t) data matrix (d data points in each of t targets), X is the (d, p) design matrix (d data points for each of p predictors), and β is a (p, t) matrix (with p coefficients, one for each predictor, for each of the targets). Ordinary least-squares regression (OLS) and regression based on the Moore-Penrose pseudoinverse (in cases where p > d) attempt to find the set of coefficients β that minimize squared error for each of the targets y. While these unregularized approaches have some desirable properties, in practical applications where noise is present, they tend to overfit the coefficient parameters to the noise present in the data. Moreover, they tend to cause unstable parameter estimates in situations where predictors are highly correlated.
Ridge regression (Hoerl and Kennard, 1970) addresses these issues by trading off the addition of some bias for the reduction of eventual error (e.g., measured using cross-validation (Stone, 1978(Stone, , 1974). It does so by penalizing not only the sum of the squared errors in fitting the data for each target, but by also minimizing the squared L2-norm of the solution, ||β|| 2 2 = β 2 . Fortunately, this form of regularization does not incur a substantial computational cost. This is because it can be implemented using the same numerical approach for solving unregularized regression, with the simple addition of a diagonal matrix αI to the standard matrix equations. Thus, the computational cost of solving ridge regression is essentially identical to that of the unregularized solution. Thanks to its simplicity, computational expedience, and its robustness in different data regimes, ridge regression is a very popular technique, with the classical references describing the method (Hoerl and Kennard, 1970;Tikhonov and Arsenin, 1977) cited more than 25,000 times according to Google Scholar.
However, beneath the apparent simplicity of ridge regression is the fact that for most applications, it is impossible to determine a priori the degree of regularization that yields the best solution. This means that in typical practice, researchers must test several different hyperparameter values α and select the one that yields the least cross-validation error on a set of data specifically held out for hyperparameter selection. In large-scale data problems, the number of data points d, number of predictors p, and/or number of targets t can be quite large. This has the consequence that the number of hyperparameter values that are tested, f , can pose a prohibitive computational barrier.
Given the difficulty of predicting the effect of α on solution outcomes, it is common practice to test values that are widely distributed on a log scale (for example, see (Friedman et al., 2010)). Although this approach is not grounded in a particular theory, as long as the values span a large enough range and are spaced densely enough, an approximate minimum of the cross-validation error is likely to be found. But testing many α values can be quite costly, and the practitioner might feel tempted to cull the set of values tested. In addition, it is always a possibility that the initial chosen range might be mismatched to the problem at hand. Sampling α values that are too high or too low will produce non-informative candidate solutions that are either over-regularized (α too high) or too similar to the unregularized solution (α too low). Thus, in practice, conventional implementations of ridge regression may produce poor solutions and/or waste substantial computational time.
Here, we propose a simple reparameterization of ridge regression that overcomes the aforementioned challenges. Our approach is to produce coefficient solutions that have an L2-norm that is a pre-specified fraction of the L2-norm of the unregularized solution. In this approach, called fractional ridge regression (FRR), redundancies in candidate solutions are avoided because solutions with different fractional L2-norms are guaranteed to be different. Moreover, by targeting fractional L2-norms that span the full range from 0 to 1, the FRR approach explores the full range of effects of regularization on β values from under-to over-regularization, thus assuring that the best possible solution is within the range of solutions explored. We provide a fast and automated algorithm to calculate FRR, and provide open-source software implementations in Python and MATLAB. We demonstrate in benchmarking simulations that FRR is computationally efficient for even extremely large data problems, and we show that FRR applies successfully to real-world data and delivers clear and interpretable results. Overall, FRR may prove particularly useful for researchers tackling large-scale datasets where automation, efficiency, and interpretability are critical.

Background and theory
Consider the dataset y with dimensionality d (number of data points) by t (number of targets). Each column in y represents a separate target for linear regression: where y is the measured data for a single target (dimensionality d by 1; we drop the index on y for notational convenience), X is the "design" matrix with predictors (dimensionality d by p), β are the coefficients (dimensionality p by 1), and is a noise term. Our typical objective is to solve for β in a way that minimizes the squared error. If X is full rank, the ordinary least squares (OLS) solution to this problem is: In cases where X is not full rank, the OLS solution is no longer well-defined and the Moore-Penrose pseudoinverse is used instead. We will refer to these unregularized approaches collectively as OLS.
To regularize the OLS solution, ridge regression applies a penalty to the squared L2norm of the coefficients, leading to a different estimator for β: where α is a hyperparameter and I is the identity matrix (Hoerl and Kennard, 1970;Tikhonov and Arsenin, 1977). For computational efficiency, it is well known that the original problem can be rewritten (Hastie and Tibshirani, 2004) using singular value decomposition (SVD) of the matrix X: with dim(U ) = (d, p), dim(S) = (p, p) and dim(V ) = (p, p) . Note that S is a square matrix: with λ i as the singular values ordered from largest to smallest. Replacing the design matrix X with its SVD, we obtain: Given that U is unitary (i.e., U t U is I), left-multiplying each side with U t produces: Letỹ = U t y,β = V t β, and˜ = U t . These are transformations (rotations) of the original quantities (y, β, and ) through the unitary matrices U t and V t . In cases where p < d, this also projects the quantities into a lower-dimensional space of dimensionality p. The OLS solution can be obtained in this space: which simplifies to:β where is the Hadamard (element-wise) product, and is the inverse of the square of the singular value matrix S. Thus, in the lower-dimensional space, we can solve the OLS problem with a scalar multiplication: which simplifies finally toβ The SVD-based reformulation of regression described above is additionally useful as it provides insight into the nature of ridge regression (Skouras et al., 1994). Specifically, consider the ridge regression solution in the low-dimensional space: To compute this solution, we note that: the inverse of which is: Finally, plugging into equation 11, we obtain: This shows that in the low-dimensional space, ridge regression can be solved using scalar operations.
To further illustrate the relationship between the ridge regression and OLS solutions, by plugging equation 10 into equation 14, we observe the following: In other words, the ridge regression coefficients are simply scaled-down versions of the OLS coefficients, with a different amount of shrinkage for each coefficient. Coefficients associated with larger singular values are less shrunken than those with smaller singular values. To obtain solutions in the original space, it is necessary to left-multiply the coefficients with V :β = Vβ Next, we turn to fractional ridge regression (FRR). The core concept of FRR is to reparameterize ridge regression in terms of the amount of shrinkage applied to the overall L2-norm of the solution. Specifically, we define the fraction γ as: Because V is a unitary transformation, the L2-norm of a coefficient solution in the lowdimensional space, ||β|| 2 , is identical to the L2-norm of the coefficient solution in the original space, ||β|| 2 . Thus, we can operate fully within the low-dimensional space and be guaranteed that the fractions will be maintained in the original space.
In FRR, instead of specifying desired values for α, we instead specify values of γ between 1 (no regularization) and 0 (full regularization). But how can one compute the ridge regression solution for a specific desired value of γ? Based on equations 9 and 14, it is easy to calculate the value of γ corresponding to a specific given α value: In some special cases, this calculation can be considerably simplified. For example, if the eigenvalue spectrum of X is flat (λ i = λ j for any i = j), we can set all the singular values to λ, yielding the following: This recapitulates the result obtained in (Hoerl and Kennard, 1970), equation 2.6. We can then solve for α: Thus, in this case, there is an analytic solution for the appropriate α value, and one can proceed to compute the ridge regression solution using equation 14. Another special case is if we assume that all values ofỹ i are identical. In this case, we can easily calculate the achieved shrinkage, but in terms of L1-norm (not L2-norm): Notice that this is the sum of the shrinkages for individual coefficients from equation 15, and has been defined as the effective degrees of freedom of ridge regression (See Hastie et al. (2001), pg. 68). These two special cases have the appealing feature that the regularization level can be controlled on the basis of examining only the design matrix X. However, they rely on strong assumptions that are not guaranteed to hold in general. Thus, for accurate ridge regression outcomes, we see no choice but to develop an algorithm that uses both the design matrix X and the data values y.

An algorithm for solving fractional ridge regression
.
Our proposed algorithm is straightforward: it evaluates γ for a range of α values and uses interpolation to determine the α value that achieves the desired fraction γ. Although this method relies on brute force and may not seem mathematically elegant, it achieves accurate outcomes and, somewhat surprisingly, can be carried out with minimal computational cost.
The algorithm receives as input a design matrix X, target variables Y , and a set of requested fractions γ. The algorithm calculates the FRR solutions for all targets in Y , returning estimates of the coefficientsβ as well as the values of hyperparameter α that correspond to each requested γ. In the text below, we indicate the lines of code that implement each step of the algorithm (see also section 2.3 below) in the MATLAB (designated with "M") and Python (designated with "P") implementations.
1. Compute the SVD of the design matrix, U SV t = X (M247, P73). To avoid numerical instability, very small eigenvalues of X are treated as 0.
4. The values of α that satisfy the FRR requirement are guaranteed to lie within a range that depends on the eigenvalues of X. A series of initial candidate values of α are selected to span a log-spaced range from 10 −3 λ 2 p , much smaller than the smallest singular value of the design matrix, to 10 3 λ 2 1 , much larger than the largest singular value of the design matrix (M295-299, P89-96). Based on testing on a variety of regression problems, we settled on a spacing of 0.2 log 10 units within the range of candidate α values. This provides fine enough gridding such that interpolation results are nearly perfect (empirical fractions are approximately 1% or less from the desired fractions).
5. Based on equation 15, a scaling factor for every value of α and every eigenvalue λ is calculated as (M310-312, P97-98): 6. The main loop of the algorithm iterates over targets. For every target, the scaling in equation 22 is applied to the computed OLS coefficients (from Step 3), and the L2-norm of the solution at each α j is divided by the L2-norm of the OLS solution to determine the fractional length, γ j (M332-345, P111-112).
7. Interpolation is used with α j and γ j to find values of α that correspond to the desired values of γ (M361, P113). These target α values are then used to calculate the ridge regression solutions via equation 15 (M367, P116-117).
8. After the iteration over targets is complete, the solutions are transformed to the original space by multiplyingβ = Vβ (M418, PL119).
In summary, this algorithm requires just one (potentially computationally expensive) initial SVD of the design matrix. Operations done on a per-target basis are generally inexpensive, relying on fast vectorized array operations, with the exception of the interpolation step. Although a large range of candidate α values are evaluated internally by the algorithm, these values are eventually discarded, thereby avoiding costs associated with the final step (multiplication with V ).

Software implementation
We implemented the algorithm described in section 2.2 in two different popular statistical computing languages: MATLAB and Python (example code in Figure 1). The code for both implementations is available at https://github.com/nrdg/fracridge and released under an OSI-approved, permissive open-source license to facilitate its broad use. In both MATLAB and Python, we used broadcasting to rapidly perform computations over multiple dimensions of arrays.
There are two potential performance bottlenecks in the code. One is the SVD step which is expensive both in terms of memory and computation time. This step is optimized by observing that for cases in which d > p (the number of data points is larger than the number of parameters), we can replace the singular values of X by the square roots of the singular values of X t X, which is p by p and therefore requires less memory for the SVD computation. The other potential performance bottleneck is the interpolation performed for each target. To optimize this step, we used fast interpolation functions that assume sorted inputs.

Matlab
The MATLAB implementation of FRR relies only on core MATLAB functions and a fast implementation of linear interpolation (Mier, 2020), which is copied into the fracridge source code, together with its license, which is compatible with the fracridge license. The MATLAB implementation includes an option to automatically standardize predictors (either center or also scale the predictors) before regularization, if desired.

Simulations
Numerical simulations were used to characterize FRR and compare it to a heuristic approach for hyperparameter selection. Simulations were conducted using the MATLAB implementation. We simulated two simple regression scenarios. The number of data points (d) was 100, and the number of predictors (p) was either 5 or 100. In each simulation, we first created a design matrix X (d, p) using the following procedure: (i) generate normally distributed values for X, (ii) induce correlation between predictors by selecting two predictors at random, setting one of the predictors to the sum of the two predictors plus normally distributed noise, and repeating this procedure 2p times, and (iii) z-scoring each predictor. Next, we created a set of "ground truth" coefficients β with dimensions (p, 1) by drawing values from the normal distribution. Finally, we simulated responses from the model (y = Xβ) and added normally distributed noise, producing a target variable y with dimensions (d, 1). Given design matrix X and target y, cross-validated regression was carried out. This was done by splitting X and y into two halves (50/50 training/testing split), solving ridge regression on one half (training) and evaluating generalization performance of the estimated regression β weights on the other half (testing). Performance was quantified using the coefficient of determination (R 2 ). For standard ridge regression, we evaluated a grid of α values that included 0 and ranged from 10 −4 through 10 5.5 in increments of 0.5 log 10 units. For FRR, we evaluated a range of fractions γ from 0 to 1 in increments of 0.05. Thus, the number of hyperparameter values was f = 21 in both cases.
The code that implements these simulations is available in the "examples" folder of the software.

Performance benchmark
To characterize the performance of fractional ridge regression (FRR) and standard ridge regression (SRR) approaches, a set of numerical benchmarks was conducted using the MAT-LAB implementation. A range of regression scenarios were constructed. In each experiment, we first constructed a design matrix X (d, p) consisting of values drawn from a normal distribution. We then created "ground truth" coefficients β (p, t) also by drawing values from the normal distribution. Finally, we generated a set of data Y (d, t) by predicting the model response (y = Xβ) and adding zero-mean Gaussian noise with standard deviation equal to the standard deviation of the data from each target variable. Different levels of regularization (f ) were obtained for SRR by linearly spacing α values on a log 10 scale from 10 −4 to 10 5 and for FRR by linearly spacing fractions from 0.05 to 1 in increments of 0.05.
Two versions of SRR were implemented and evaluated. The first version (naïve) involves a separate matrix (pseudo-)inversion for each hyperparameter setting desired. The second version (rotation-based) involves using the SVD decomposition method described above (see section 2.1, specifically equation 14).
All simulations were run on an Intel Xeon E5-2683 2.10 Ghz (32-core) workstation with 128 GB of RAM, a 64-bit Linux operating system, and MATLAB 8.3 (R2014a). Execution time was logged for model fitting procedures only and did not include generation of the design matrix or the data. Likewise, memory requirements were recorded in terms of additional memory usage during the course of model fitting (i.e. zero memory usage corresponds to the total memory usage just prior to the start of model fitting). Benchmarking results were averaged across 15 independent simulations to reduce incidental variability.
The code that implements these benchmarks is available in the "examples" folder of the software.

Brain Magnetic Resonance Imaging data
Brain functional Magnetic Resonance Imagine (fMRI) data were collected in a 7 Tesla MRI instrument, at a spatial resolution of 1.8 mm and a temporal resolution of 1.6 s and using a matrix size of [81 104 83]. This yielded a total of 783,432 voxels. Over the course of 40 separate scan sessions, a participant viewed 10,000 distinct images (3 presentations per image) while fixating a small dot placed at the center of the images (see Figure 3A). The images were 8.4 deg by 8.4 degin size. Each image was presented for 3 s and was followed by a 1 s gap. Standard pre-processing steps were applied to the fMRI data to remove artifacts due to head motion and other confounding factors. To deal with session-wise nonstationarities, response amplitudes of each voxel were z-scored within each scan session. Responses were then concatenated across sessions and averaged across trials of the same image, and then a final z-scoring of each voxel's responses was performed.
A regression model was used to predict the response observed from a voxel in terms of local contrast present in the stimulus image. In the model, the stimulus image is preprocessed by taking the original color image (425 pixels by 425 pixels by 3 RGB channels), converting the image to grayscale, gridding the image into 25 by 25 regions, and then computing the standard deviation of luminance values within each grid region ( Figure 4B). This produced 625 predictors, each of which was then z-scored. The design matrix X has dimensionality 10,000 images by 625 stimulus regions, while Y has dimensionality 10,000 images by 783,432 voxels.
Cross-validation was carried out using a 80/20 training/testing split. For standard ridge regression, we evaluated a grid of alpha values that included 0 and ranged from 10 −4 to 10 5.5 in increments of 0.5 log 10 units. For fractional ridge regression, we evaluated a range of fractions from 0 to 1 in increments of 0.05. Cross-validation performance was quantified in terms of variance explained on the test set using the coefficient of determination (R 2 ).
The code that implements these benchmarks is available in the "examples" folder of the software.

Fractional ridge regression achieves the desired outcomes
In simulations, we demonstrate that the fractional ridge regression (FRR) algorithm accurately produces the desired fractions γ (Figure 2 A,B second row, right column in each). We compare the results of FRR to results of standard ridge regression (SRR), in which a commonly-used heuristic is used to select α values (log-spaced values spanning a large range). For the SRR approach, we find that the fractional L2-norm is very small and virtually indistinguishable for large values of α, and is very similar to the OLS solution (fractional L2-norm approximately 1) for several small values of α (Figure 2 A, B second row, left column). In addition, cross-validation accuracy is indistinguishable for many of the values of α evaluated in SRR. Only very few values of α produce cross-validated R 2 values that are similar to the value provided by the best α (Figure 2 A, B first row, left column).
The SRR results can also be re-represented using effective degrees of freedom (DOF; Figure 2 A, B first row, middle column): several values of α result in essentially the same number of DOF, because these values are either much larger than the largest singular value or much smaller than the smallest singular value of X. In contrast to SRR, FRR produces a nicely behaved range of cross-validated R 2 values and dense sampling around the peak R 2 .
Another line of evidence highlighting the diversity of the solutions provided by FRR is given by inspecting coefficient paths: in the log-spaced case, coefficients start very close to 0 (for high α) and rapidly increase (for lower α). Even when re-represented using DOF, the coefficient paths exhibit some redundancy. In contrast, FRR provides more gradual change in the coefficient paths, indicating that this approach explores the space of possible coefficient configurations more uniformly. Taken together, these analyses demonstrate that FRR provides a more useful range of regularization levels than SRR. Notice that in both scenarios, only the FRR method achieves regression solutions whose L2-norms increase linearly, with gradually changing coefficient paths.

FRR is computationally efficient
A question of relevance to potential users of FRR is whether using the method incurs significant computational cost. We compare FRR to two alternative approaches. The first approach is a naïve implementation of the matrix inversion specified in equation 3, in which the Moore-Penrose pseudo-inverse (implemented as pinv in Matlab and numpy.linalg.pinv in Python) is performed independently for each setting of hyperparameter α. The second approach takes advantage of the computational expedience of the SVD-based approach: instead of a matrix inversion for each α value, a single SVD is performed, a transformation (rotation) is applied to the data, and different values of α are plugged into equation 14 to compute the regression coefficients. This approach comprises a subset of the operations taken in FRR. Therefore, it represents a lower bound in terms of computational requirements. Through systematic exploration of different problem sizes, we find that FRR performs quite favorably. FRR differs from the rotation-based approach only slightly with respect to execution-time scaling in the number of data points ( Figure 3A, left column), in the number of parameters ( Figure 3A, right column), and in f , the number of hyperparameter values considered ( Figure 3A, third column ). The naïve matrix-inversion approach is faster than both SVD-based approaches (FRR and rotation-based) for f < 20, but rapidly becomes much more costly for values above 20. This approach also scales rather poorly for p > 5, 000.
In terms of memory consumption, the mean and maximum memory usage are very similar for FRR, the naïve and rotation-based solutions. These results suggest that for each of these approaches, the matrix inversion (for the naïve implementation of SRR) or the SVD (for FRR and the SVD-based SRR) represents the main computational bottleneck. Importantly, despite the fact that FRR uses additional gridding and interpolation steps, it does not perform substantially worse than either of the other approaches.

Application of FRR on real-world data
To demonstrate the practical utility of FRR, we explore its application in a specific scientific use-case. Data from a functional Magnetic Resonance Imaging (fMRI) experiment were analyzed with FRR and the results of this analysis were compared to a standard ridge regression (SRR) approach where α values are selected using a log-spaced heuristic. Different parts of the brain process different types of information, and a large swath of the cerebral cortex is known to respond to visual stimulation. Experiments that combine fMRI with computational analysis provide detailed information about the responses of different parts of the brain (Wandell et al., 2015). In the experiments analyzed here, a series of images are shown and the MRI blood-oxygen-level-dependent (BOLD) signal is recorded in a sampling grid of voxels throughout the brain ( Figure 4A). In the cerebral cortex, each voxel contains hundreds of thousands of neurons. If these neurons respond vigorously to the visual stimulus presented, the metabolic demand for oxygen in that part of cortex will drive a transient increase in oxygenated blood in that region, and the BOLD response will increase. Thus, a model of the BOLD response tells us about the selective responses of neurons in each voxel in cortex.
Because neurons in parts of the cerebral cortex that respond to visual stimuli are known to be particularly sensitive to local contrast, we model responses with respect to the standard deviation of luminance in each region of the image, rather than the luminance values themselves ( Figure 4B). In the model, Y contains brain responses where each target (column) represents the responses in a single voxel. Each row contains the response of all voxels to a particular image. The design matrix X contains the local contrast in every region of the image, for every image. This means that the coefficients β represent weights on the stimulus image and indicate each voxel's spatial selectivity -i.e., the part of the image to which the voxel responds (Wandell and Winawer, 2015). Therefore, one way to visualizê β is to organize it according to the two-dimensional layout of the image ( Figure 4C&D, bottom two rows).
Using FRR, we fit the model to voxel responses, and find robust model performance in the posterior part of the brain where visual cortex resides (left part of the horizontal slice presented in the top row of Figure 4C). The performance of the model can be observed in either the cross-validated R 2 values ( Figure 4C, top row, left and middle panels) or the value of γ corresponding to the best cross-validated R 2 . For example, we can focus on the two voxels highlighted in the middle panel of the top row in Figure 4C. One voxel, whose characteristics are further broken down in Figure 4D has lower cross-validated R 2 = 4% and requires stronger regularization (γ = 0.15). The spatial selectivity of this voxel's responses becomes very noisy at large γ values and R 2 approaches 0. On the other hand, the voxel in Figure 4E has a higher γ = 0.35 and a higher cross-validated R 2 = 13%. Moreover, this voxel appears more robust with higher values of γ producing less spatially noisy results. The map of R 2 and γ illustrated in Figure 4C show that these trends hold more generally: voxels with more accurate models require less regularization.

Discussion
The main theoretical contribution of this work is a novel approach to hyperparameter specification in ridge regression. Instead of the standard approach in which a heuristic range of values for hyperparameter α are evaluated for their accuracy, the fractional ridge regression (FRR) approach focuses on achieving specific fractions for the L2-norms of the solutions relative to the L2-norm of the unregularized solution. In a sense, this is exactly in line with the original spirit of ridge regression, which places a penalty on the L2-norm of the solution. The main practical contribution of this work is the design and implementation of an efficient algorithm to solve FRR and validation of this algorithm on simulated and empirical data. Overall, we suggest that FRR can serve as a default approach to solving ridge regression.

The benefits of FRR
1. Theoretically-motivated and principled. The results in section 3.1 demonstrate that the theoretical motivation described in the Methods holds in practice. Our implementation of FRR produces ridge regression solutions that have predictable and tuneable fractional L2-norm.
2. Statistically efficient. Each fraction level returned by FRR produces β values that are distinctly different. This avoids the common pitfall in the log-spaced approach whereby computation is wasted on several values of α that all over-regularize or underregularize. When used with a range of γ values from 0 to 1, the solution that minimizes cross-validation error is guaranteed to exist within this range (although it may lie in between two of the obtained solutions).
3. Computationally efficient. We show that our implementation of FRR requires memory and computational time that are comparable to a naïve ridge regression approach and to an approach that uses SVD but relies on preset α values. SVD-based approaches (including FRR) scale linearly in f , with compute-time scaling better than naïve RR in the f > 20 regime. In practice, we have found that f = 20 evenly distributed values between 0 and 1 provide sufficient coverage for many problems. But the linear scaling implies that sampling more finely would not be limiting in cases where additional precision is needed.

4.
Interpretable. FRR uses γ values that represent scaling relative to the L2-norm of the OLS solution. This allows FRR results to be compared across different targets within a dataset. This is exemplified in section 3.3, in which results from an fMRI experiment are interpreted both in light of cross-validated R 2 and the optimal γ that leads to the best cross-validated R 2 . Moreover, regularization in different datasets and for different models (e.g., different settings of X) can be compared to each other as being stronger or weaker. The optimal regularization level can be informative regarding the signal-to-noise of a particular target or about the level of collinearity of the design matrix (which both influence the optimal level of regularization). FRR increases the interpretability of ridge regression, because instead of an unscaled, relatively inscrutable value of α, we receive a scaled, relatively interpretable value. Based on a recently proposed framework for interpretability in machine learning methods (Murdoch et al., 2019), we believe that this kind of advance improves the descriptive accuracy of ridge regression.

5.
Automatic. Machine learning algorithms focus on automated inferences, but many machine learning algorithms still require substantial manual tuning. For example, if the range of α values used is not sufficient, users of ridge regression may be forced to explore other values. This is impractical in cases in which thousands of targets are analyzed and multiple models evaluated. Thus, FRR contributes to the growing field of methods that aim to automate machine learning methods (Zöller and Huber, 2019; Tuggener et al., 2019). These methods all aim to remove the burden of manual inspection and tuning of machine learning. A major benefit of FRR is therefore practical in nature: Because FRR spans the dynamic range of effects that ridge regression can provide, using FRR guarantees that the time taken to explore hyperparameter values is well spent. Moreover, the user does not have to spend time speculating what α values might be appropriate for a given problem (e.g. is 10 4 sufficiently high?).

6.
Implemented in usable open-source software. We provide code that is welldocumented, thoroughly tested, and easy to use: https://github.com/nrdg/fracridge. The software is available in two popular statistical programming languages: MATLAB and Python. The Python implementation provides an object-oriented interface that complies with the popular Scikit-Learn library (Pedregosa et al., 2011;Buitinck et al., 2013).

Limitations
One limitation of FRR is that a heuristic approach is used within the algorithm to generate the grid of α values used for interpolation (see section 2.2 for details). Nonetheless, the interpolation results are quite accurate, and costly computations are carried out only for final desired α values. Another limitation is that the α value that corresponds to a specific γ may be different for different targets and models. If there are theoretical reasons to retain the same α across targets and models, the FRR approach is not appropriate. But this would rarely be the case, as α values are usually not directly interpretable. Alternatively, FRR can be used to estimate values of α on one sample of the data (or for one model) and these values of α can then be used in all of the data (or all models).
Finally, the FRR approach is limited to ridge regression and does not generalize easily to other regularization approaches. The Lasso (Tibshirani, 1996) provides regression solutions that balance least-squares minimization with the L1-norm of the coefficients, rather than the L2-norm of the coefficients. The Lasso approach has several benefits, including results that are more sparse and potentially easier to interpret. Similarly, Elastic Net (Zou and Hastie, 2005) uses both L1-and L2-regularization, potentially offering more accurate solutions. But because the computational implementation of these approaches differs quite substantially from ridge regression, the approach presented in this paper does not translate easily to these methods. Moreover, while these methods allow regularization with a non-negativity constraint on the coefficients, this constraint is not easily incorporated into L2-regularization. On the other hand, a major challenge that arises in L1-regularization is computational time: most algorithms operate for one target at a time and incur substantial computational costs, and scaling such algorithms to the thousands of targets in large-scale datasets may be difficult.

Future extensions
An important extension of the present work would be an implementation of these ideas in additional statistical programming languages, such as the R programming language, which is very popular for use in statistical analysis of data from many different domains. One of the most important tools for regularized regression is the glmnet software package which was originally implemented in the R programming language (Friedman et al., 2009) and has implementations in MATLAB (Qian et al., 2013) and Python (Balakumar, 2016). The software also provides tools for analysis and visualization of coefficient paths and of the effects of regularization on cross-validated error. The R glmnet vignette (Hastie and Qian, 2014) demonstrates the use of these tools. In addition to identifying the α value that minimizes cross-validation error, glmnet also identifies the α which gives the most regularized model such that the cross-validated error is within one standard error of the minimum cross-validated error. This approach acknowledges that there is some error in selecting α and chooses to err on the side of a more parsimonious model (Friedman et al., 2010). Future extensions of FRR could implement this heuristic.  Functional MRI measurements of brain activity were collected from a human participant while s/he viewed a series of natural images (10,000 distinct images presented three times each). (B) Model of brain activity. Images were converted to grayscale and gridded, and then standard deviation of luminance values within each grid element was calculated. This produced measures of local contrast. Brain responses at every voxel were modeled using a weighted sum of local contrast. (C) Results obtained using FRR. Cross-validated performance (variance explained) achieved by the model is shown for an axial brain slice (middle). These results are thresholded at 5% and superimposed on an image of brain anatomy for reference (left). The fraction (γ) corresponding to the best cross-validation performance is also shown (right). (D) Detailed results for one voxel (see green squares in panel C). The main plots that depict training and testing performance and L2-norm are in the same format as Figure 1. The inset illustrates coefficient solutions for different regularization levels. The orange box highlights the regularization level producing highest cross-validation performance. (E) Detailed results for a second voxel. Same format as panel D.