Summary

In high-dimensional multivariate regression problems, enforcing low rank in the coefficient matrix offers effective dimension reduction, which greatly facilitates parameter estimation and model interpretation. However, commonly used reduced-rank methods are sensitive to data corruption, as the low-rank dependence structure between response variables and predictors is easily distorted by outliers. We propose a robust reduced-rank regression approach for joint modelling and outlier detection. The problem is formulated as a regularized multivariate regression with a sparse mean-shift parameterization, which generalizes and unifies some popular robust multivariate methods. An efficient thresholding-based iterative procedure is developed for optimization. We show that the algorithm is guaranteed to converge and that the coordinatewise minimum point produced is statistically accurate under regularity conditions. Our theoretical investigations focus on non-asymptotic robust analysis, demonstrating that joint rank reduction and outlier detection leads to improved prediction accuracy. In particular, we show that redescending ψ-functions can essentially attain the minimax optimal error rate, and in some less challenging problems convex regularization guarantees the same low error rate. The performance of the proposed method is examined through simulation studies and real-data examples.

1. Introduction

Given n observations of m response variables and p predictors, denoted by yim and xip for i=1,,n, we consider the multivariate regression model
Y=XB+E,
where Y=(y1,,yn)T, X=(x1,,xn)T, B*p×m is an unknown coefficient matrix, and E=(e1,,en)Tn×m is a random error matrix. Such a high-dimensional multivariate problem, in which both p and m may be comparable to or even exceed the sample size n, has drawn increasing attention in both applied and theoretical statistics.
Conventional least-squares linear regression ignores the multivariate nature of the problem and may fail when p is large relative to n. Dimension reduction holds the key to characterizing the dependence between responses and predictors in a parsimonious way. Reduced-rank regression (Anderson, 1951; Izenman, 1975) achieves this by restricting the rank of the coefficient matrix, i.e., by solving the problem
minBRp×mtr{(YXB)Γ(YXB)T}subject to r(B)r,
(1)
where tr() and r() denote trace and rank, and Γ is a prespecified positive-definite weighting matrix (Reinsel & Velu, 1998). The ranks are typically much smaller than m and p. A global solution to (1) can be obtained explicitly. See Reinsel & Velu (1998) for a comprehensive account of reduced-rank regression under the classical large-n asymptotic regime. Finite-sample theories on rank selection and estimation accuracy of the penalized form of reduced-rank regression were developed by Bunea et al. (2011). The nuclear norm and Schatten p-norms can also be used to promote sparsity of the singular values of B or XB; see Yuan et al. (2007), Koltchinskii et al. (2011), Rohde & Tsybakov (2011), Agarwal et al. (2012), Foygel et al. (2012) and Chen et al. (2013), among others. Reduced-rank regression is closely connected with principal component analysis, canonical correlation analysis, partial least squares, matrix completion, and many other multivariate methods (Izenman, 2008).

Although reduced-rank regression can substantially reduce the number of free parameters in multivariate problems, it is extremely sensitive to outliers, which are bound to occur; so in real-world data analysis, the low-rank structure could easily be masked or distorted. This is even more serious in high-dimensional or big-data applications. For example, in cancer genetics, multivariate regression is commonly used to explore the associations between genotypical and phenotypical characteristics (Vounou et al., 2010), where employing rank regularization can help to reveal latent regulatory pathways linking the two sets of variables; but pathway recovery should not be distorted by abnormal samples or subjects. As another example, financial time series, even after stationarity transformation, often contain anomalies or have heavier tails than those of a normal distribution, which may jeopardize the recovery of common market behaviours and asset return forecasting; see § 3 in the Supplementary Material.

In this work, we deem explicit outlier detection to be as important as robust low-rank estimation. Indeed, the reduced-rank component may not be of direct interest in some applications, as it often represents common background information shared across the response variables, while capturing unusual changes or jumps is helpful. The robustification of low-rank matrix estimation is nontrivial. A straightforward idea might be to use a robust loss function ρ in place of the squared error loss in (1), leading to
minBi=1nρ{Γ1/2(yiBTxi)2}subject to r(B)r,
(2)
but such an estimator may be difficult to compute. To the best of our knowledge, even when ρ is Huber’s loss function (Huber, 1981) there is no algorithm for solving (2), let alone when it involves nonconvex losses, which are known to be more effective in dealing with multiple gross outliers with possibly high leverage values. Another motivation is that non-asymptotic theory on the topic is limited. Classical robust analysis, ignoring the low-rank constraint, deals with either deterministic worst-case studies or large-n asymptotics with p and m held fixed, which may not meet modern needs.

We propose a novel robust reduced-rank regression method for concurrent robust modelling and outlier identification. We explicitly introduce a sparse mean-shift outlier component and formulate a shrinkage multivariate regression in place of (2), where p and/or m can be much larger than n. The proposed robust reduced-rank regression provides a general framework and includes M-estimation and principal component pursuit (Huber, 1981; Hampel et al., 2005; Zhou et al., 2010; Candès et al., 2011). All the techniques developed in this work apply to high-dimensional sparse regression with a single response. In § 2 we show that low-rank estimation can be ruined by a single rogue point, and propose a robust reduced-rank estimation framework. A universal connection between the proposed robustification and conventional M-estimation is established, regardless of the size of p, m or n. In § 3 we conduct finite-sample theoretical studies of the proposed robust estimators, with the aim of extending classical robust analysis to multivariate data with possibly large p and/or m. A computational algorithm developed in § 4 is easy to implement and leads to a coordinatewise minimum point with theoretical guarantees. Applications to real data are demonstrated in § 5. All the proofs and results of simulation studies, as well as a financial example, are given in the Supplementary Material.

The following notation will be used throughout the paper. We denote by the set of natural numbers. We use ab to denote min(a,b) and e to denote the Euler constant. Let [n]={1,,n}. Given any matrix A, PA denotes the orthogonal projection matrix onto the range of A, i.e., A(ATA)-AT, where - stands for the Moore–Penrose pseudo-inverse. When there is no ambiguity, we also use PA to denote the column space of A. Let AF denote the Frobenius norm and A2 the spectral norm, and let A0=vec(A)0=|{(i,j):A(i,j)0}| with || denoting the cardinality of the enclosed set. For A=(a1an)Tn×m, A2,1=i=1nai2 and A2,0=i=1n1αi0, which gives the number of nonzero rows of A. Given J[n], we often denote iJai2 by AJ2,1. Threshold functions are defined as follows.

 
Definition 1

(Threshold function). A threshold function is a real-valued function Θ(t;λ) defined for -<t< and 0λ< such that (i) Θ(-t;λ)=-Θ(t;λ); (ii) Θ(t;λ)Θ(t;λ)fortt; (iii) limtΘ(t;λ)=; and (iv) 0Θ(t;λ)tfor0t<.

 
Definition 2

(Multivariate threshold function). Given any Θ, Θ is defined for any vector aRm by Θ(a;λ)=aΘ(a2;λ)/a2 for a0 and 0 otherwise. For any matrix A=(a1an)Tn×m, Θ(A;λ)={Θ(a1;λ)Θ(an;λ)}T.

2. Robust reduced-rank regression

2.1 Motivation

Although reduced-rank regression is associated with a highly nonconvex problem (1), a global minimizer B̂ can be obtained in explicit form. Given any r such that 1rmin(m,q) with q=r(X),
B^(r)=R(X,Y,Γ,r)=(XTX)XTYΓ1/2PV(X,Y,Γ,r)Γ1/2,
(3)
where V(X,Y,Γ,r) is formed by the leading r eigenvectors of Γ1/2YTPXYΓ1/2; see, e.g., Reinsel & Velu (1998) for a detailed justification. When Γ=I, we abbreviate R(X,Y,I,r) to R(X,Y,r). The reduced-rank regression estimator is denoted by B̂(r) to emphasize its dependence on the regularization parameter.
Outliers are unavoidable in real data. We define the finite-sample breakdown point for an arbitrary estimator B̂ in the spirit of Donoho & Huber (1983): given finite data (X,Y,Γ) and an estimator B̂(X,Y,Γ), its breakdown point is
ϵ(B^)=1nmin{kN{0}:supY~Rn×m:Y~Y0kXB^(X,Y~,Γ)F=+}.
In addition to the reduced-rank regression estimator B̂(r), we take into account a general low-rank estimator obtained by imposing a singular value penalty
B^(λ)argminB12tr{(YXB)Γ(YXB)T}+s=1pmP(σsBΓ1/2;λ).
(4)
Here λ is a regularization parameter and the σsBΓ1/2 are the singular values of BΓ1/2. The penalty P is constructed from an arbitrary thresholding rule Θ(;λ) by
P(t;λ)P(0;λ)=PΘ(t;λ)+q(t;λ),PΘ(t;λ)=0|t|[sup{s:Θ(s;λ)u}u]du,
(5)
for some nonnegative q(;λ) satisfying q{Θ(s;λ);λ}=0 for all s.
 
Theorem 1

Given any finite (X,Y,Γ) and r1 with Γ positive definite and X0, let B̂(r) be a reduced-rank regression estimator which solves (1). Then its finite-sample breakdown point is exactly 1/n. Furthermore, for any B̂(λ) as in (4), ϵ*{B̂(λ)}=1/nstill holds for any finite value of λ.

The result indicates that a single outlier can completely ruin low-rank matrix estimation, whether one applies a rank constraint or, say, a Schatten p-norm penalty. This limits the use of ordinary rank reduction in big-data applications. Because with the low-rank constraint, directly using a robust loss function as in (2) may result in nontrivial computational and theoretical challenges, we will apply a novel additive robustification, motivated by She & Owen (2011).

2.2 The additive framework

We introduce a multivariate mean-shift regression model to explicitly encompass outliers:
Y=XB+C+E,
(6)
where B*p×m gives the matrix of coefficients, C*n×m describes the outlying effects on Y, and En×m has independently and identically distributed rows following N(0,Σ). Obviously, this leads to an overparameterized model, so we must regularize the unknown matrices appropriately. We assume that B* has low rank and C* is a sparse matrix with only a few nonzero entries because outliers are inconsistent with the majority of the data. Given a positive-definite weighting matrix Γ, we propose the robust reduced-rank regression problem
minB,C12tr{(YXBC)Γ(YXBC)T}+P(C;λ)subject to r(B)r.
(7)
Here P(;λ) is a sparsity-promoting penalty function with λ to adjust the amount of shrinkage, but it can also be a constraint, such as in (12). The following form of P can handle elementwise outliers:
P(C;λ)=i=1nk=1mP(|ci,k|;λ).
(8)
It is more common in robust statistics to assume outlying samples, or outlying rows in (Y,X), which corresponds to
P(C;λ)=i=1nP(ci2;λ),
where ciT is the ith row vector of C. Unless otherwise specified, we consider row-wise outliers. But all our algorithms and analyses can handle elementwise outliers after simple modification.
In the literature on reduced-rank regression, it is common to regard the weighting matrix Γ as known (Reinsel & Velu, 1998; Yuan et al., 2007; Izenman, 2008). The choice of Γ is flexible and is usually based on a pilot covariance estimate Σ̂. For example, it can be Σ̂-1 when Σ̂ is nonsingular, or a regularized version (Σ̂+δI)-1 for some δ>0. Although it sounds intriguing to consider jointly estimating the high-dimensional mean and the even higher-dimensional covariance matrix in the presence of outliers, this is beyond the scope of the present paper. When a reliable estimate of Σ is unavailable, a standard practice in finance and econometric forecasting is to reduce Γ to a diagonal matrix or, equivalently, an identity matrix after robustly scaling the response variables. For ease of presentation, we shall take Γ to be the identity matrix, unless otherwise noted, and mainly focus on the following robust reduced-rank regression criterion:
minB,C12YXBCF2+P(C;λ)subject to r(B)r.
(9)

We show that the proposed additive outlier characterization indeed comes with a robustness guarantee and, interestingly, generalizes M-estimation to the multivariate rank-deficient setting. We write Y=(y1,,yn)T and C=(c1,,cn)T.

 
Theorem 2
(i) Suppose thatΘ(;λ)is an arbitrary thresholding rule satisfying Definition 1, and letPbe any penalty associated withΘthrough (5). Consider
minB,C12YXBCF2+i=1nP(ci2;λ)subjecttor(B)r.
(10)
For any fixed B, a globally optimal solution for C is C(B)=Θ(Y-XB;λ). By profiling out C with C(B), (10) can be expressed as an optimization problem with respect to B only, and it is equivalent to the robust M-estimation problem
minBi=1nρ(yiBTxi2;λ)subjecttor(B)r,
(11)
where the robust loss functionρis given by
ρ(t;λ)=0|t|ψ(u;λ)du,ψ(t;λ)=tΘ(t;λ).
(ii) Givenϱ{0,1,,n}, consider
minB,C12YXBCF2subjecttor(B)r,C2,0ϱ.
(12)
Similarly, (12), after profiling outC, can be expressed as an optimization problem with respect toBonly, and it is equivalent to the rank-constrained trimmed least-squares problem
minB12i=1nϱr(i)subjecttor(B)r,ri=yiBTxi2,
wherer(1),,r(n)are the order statistics ofr1,,rnsatisfying|r(1)||r(n)|.
 
Remark 1
Theorem 2 connects P to ρ through Θ. As is well known, changing the squared error loss to a robust loss amounts to designing a set of multiplicative weights for yi-BTxi (i=1,,n). Our additive robustification achieves the same robustness but leaves the original loss function untouched. The connection is also valid in the case of elementwise outliers, with P and ρ applied in an elementwise manner. In fact, the identity constructed in Lemma 2 of the Supplementary Material,
12{rΘ(r;λ)}2+PΘ{Θ(r;λ);λ}=0|r|ψ(t;λ)dt(rR),
implies that the equivalence holds much more generally, with B subject to an arbitrary constraint or penalty, regardless of the number of response variables and the number of predictors. This extends the main result of She & Owen (2011) to multiple-response models with p possibly larger than n.
 
Remark 2

Theorem 2 holds for all thresholding rules, and commonly used convex and nonconvex penalties are all covered by (5). For example, the convex group 1 penalty λci2 is associated with the soft-thresholding ΘS(s;λ)=sgn(s)(|s|-λ)+. The group 0 penalty (λ2/2)i=1n1ci20 can be obtained from (5) with the hard-thresholding ΘH(s;λ)=s1|s|>λ and with q(t;λ)=0.5(λ-|t|)210<|t|<λ. Our Θ-P coupling framework also covers p for 0<p<1, the smoothly clipped absolute deviation penalty (Fan & Li, 2001), the minimax concave penalty (Zhang, 2010a), and the capped 1 penalty (Zhang, 2010b) as particular instances; see She (2012).

 
Remark 3

The universal link between (10) and (11) provides insight into the choice of regularization. It is easy to verify that the 1-norm penalty as commonly used in variable selection leads to Huber’s loss, which is prone to masking and swamping and may fail with even moderately leveraged outliers occurring. To handle gross outliers, redescending ψ-functions are often advocated, which amounts to using nonconvex penalties in (10). For example, Hampel’s three-part ψ (Hampel et al., 2005) can be shown to yield Fan and Li’s smoothly clipped absolute deviation penalty; the skipped mean ψ corresponds to the exact 0 penalty; and rank-constrained least trimmed squares can be rephrased as the 0-constrained form in (12). Our approach not only provides a unified way to robustify low-rank matrix estimation but also facilitates theoretical analysis and computation of reduced-rank M-estimators in high dimensions.

2.3 Connections and extensions

Before we dive into theoretical study, it is worth pointing out some connections and extensions of the proposed framework. First, one can set Γ equal to the inverse covariance matrix of the response variables to perform robust canonical correlation analysis; see Reinsel & Velu (1998). Although we mainly focus on the rank-constrained form, there is no difficulty in extending our discussion to
minB,C12YXBCF2+s=1pmPB(σsB;λB)+PC(C;λC),
(13)
where the σsB are the singular values of B, and PB and PC are sparsity-inducing penalties.

Our robust reduced-rank regression subsumes a special but important case, Y=B+C+E. This problem is perhaps less challenging than its supervised counterpart, but has wide applications in computer vision and machine learning (Wright et al., 2009; Candès et al., 2011).

Finally, our method can be extended to reduced-rank generalized linear models; see, for example, Yee & Hastie (2003) and She (2013) for computational details. In these scenarios, directly robustifying the loss can be messy, but a sparse outlier term can always be introduced without altering the form of the given loss, so that many algorithms designed for fitting ordinary generalized linear models can be seamlessly applied.

3. Non-asymptotic robust analysis

Theorem 2 provides robustness and some helpful intuition for the proposed method, but it might not be enough from a theoretical point of view. For example, can one justify the need for robustification in estimating a matrix of low rank? Is using redescending ψ-functions still preferable in rank-deficient settings? Unlike in traditional robust analysis, we cannot assume an infinite sample size and a fixed number of predictors or response variables, because p and/or m can be much larger than n in modern applications. Conducting non-asymptotic robust analysis would be desirable. The finite-sample results in this section contribute to this type of robust analysis.

For simplicity we assume that the model is given by Y=XB*+C*+E, where E has independent and identically distributed N(0,σ2) entries, and consider the robust reduced-rank regression problem defined in (9). The noise distribution can be more general. For example, in all of the following theorems except Theorem 5, E can be sub-Gaussian. Given an estimator (B̂,Ĉ), we focus on its prediction accuracy measured by M(B̂-B*,Ĉ-C*), where
M(B,C)=XB+CF2.

This predictive learning perspective is always legitimate in evaluating the performance of an estimator, and requires no signal strength or model uniqueness assumptions. The 2-recovery of M(B̂-B*,Ĉ-C*) is fundamental, and such a bound, together with additional regularity assumptions, can easily be adapted to obtain estimation error bounds in different norms as well as selection consistency (Ye & Zhang, 2010; Lounici et al., 2011); see Theorem 10 in the Supplementary Material, for instance. Given a penalty function P or, equivalently, a robust loss ρ, we will study the performance of the set of global minimizers to show the ultimate power of the associated method; but our techniques of proof apply more generally (see, e.g., Theorem 7).

For any C=(c1,,cn)T, define
J(C)={i:ci0},J(C)=|J(C)|=C2,0.

We let r*=r(B*) denote the rank of the true coefficient matrix, and J*=J(C*) the number of nonzero rows in C*, i.e., the number of outliers. Let q=r(X).

To handle problems in arbitrary dimensions, we construct some finite-sample oracle inequalities (Donoho & Johnstone, 1994). The first theorem considers a general penalty P(C;λ)=i=1nP(ci2;λ). Here we assume that P(;λ) takes λ as the threshold parameter and satisfies
P(0;λ)=0,P(t;λ)PH(t;λ),
(14)
where PH(t;λ)=(-t2/2+λ|t|)1|t|<λ+(λ2/2)1|t|λ. The latter inequality is natural in view of (5), because a shrinkage estimator with λ as the threshold is always bounded above by the hard-thresholding function ΘH(,λ). From Theorem 2, (14) covers all ψ-functions bounded below by the skipped mean ψH(s;λ)=s1|s|λ for any s0.
 
Theorem 3
Letλ=Aσ(m+logn)1/2withAa constant, and let(B̂,Ĉ)be a global minimizer of (9). Then, for any sufficiently largeA, the following oracle inequality holds for any(B,C)p×m×n×msatisfying r(B)r:
E{M(B^B,C^C)}M(BB,CC)+σ2(q+m)r+P(C;λ)+σ2,
(15)
where means that the inequality holds up to a multiplicative constant.
 
Corollary 1
Under the same conditions as in Theorem 3, ifr1andPis a bounded nonconvex penalty satisfyingP(t;λ)λ2for anyt, then
E{M(B^B,C^C)}inf(B,C):r(B)r{M(BB,CC)+σ2(q+m)r+σ2J(C)m+σ2J(C)logn}.
(16)
 
Remark 4
Both (15) and (16) involve a bias term M(B-B*, C-C*). Upon setting r=r*, B=B* and C=C* in, say, (16), we obtain a prediction error bound of the order
σ2(q+m)r+σ2J(m+logn).
(17)

On the other hand, the presence of the bias term ensures applicability of robust reduced-rank regression to weakly sparse C*, and similarly r may also deviate from r* to some extent, as a benefit derived from the bias-variance trade-off.

 
Remark 5

Our scheme of proof can also be used to show similar conclusions for the doubly penalized form (13) and the doubly constrained form (12), under the general assumption that the noise matrix has sub-Gaussian marginal tails. The following theorem states the result for (12), which is one of our favoured forms in practical data analysis.

 
Theorem 4
Let(B̂,Ĉ)be a solution to (12). With the convention0log0=0, we have
E{M(B^B,C^C)}infr(B)r,J(C)ϱM(BB,CC)+σ2{(q+m)r+ϱm+ϱlog(en/ϱ)}+σ2.

Theorem 4 reveals some breakdown point information as a by-product. Specifically, fixing Y¯=XB, we contaminate Y in the set B(ϱ)={Yn×m:Y=Y¯+C+E,C2,0ϱ}, where vec(E) is sub-Gaussian and ϱ{0}. Given any estimator (B̂,Ĉ) which implicitly depends on Y, we define its risk-based finite-sample breakdown point by ϵ(B^,C^)=(1/n)×min{ϱ:supYB(ϱ)E{M(B^B,C^C)}=+}, where the randomness of the estimator is well accounted for by taking the expectation. Then, for the estimator defined by (12), it follows from Theorem 4 that ϵ*(ϱ+1)/n.

We emphasize that neither Theorem 3 nor Theorem 4 places any requirement on X, in contrast to Theorem 6 below.

 
Remark 6
The benefit of applying a redescending ψ is clearly shown by Theorem 3. As an example, for Huber’s ψ, which corresponds to the popular convex 1 penalty due to Theorem 2, P(C;λ) on the right-hand side of (15) is unbounded, while Hampel’s three-part ψ gives a finite rate as seen in (16). Furthermore, we show that in a minimax sense, the error rate obtained in Corollary 1 is essentially optimal. Consider the signal class
S(r,J)={(B,C):r(B)r,J(C)J}(1rqm,1Jn/2).

Let () be a nondecreasing loss function with (0)=0, ≢ 0.

 
Theorem 5
LetY=XB+C+EwhereEhas independently and identically distributedN(0,σ2)entries. Assume thatn2, 1Jn/2, 1rqm, r(q+m-r)8, andσmin(X)/σmax(X)is a positive constant, whereσmax(X)andσmin(X)denote the largest and smallest nonzero singular values ofX, respectively. Then there exist positive constantsc˜andc, depending on()only, such that
inf(B^,C^)sup(B,C)S(r,J)E([M(B^B,C^C)/{c~Po(J,r)}])c>0,
where (B̂,Ĉ) denotes any estimator of (B*,C*) and
Po(J,r)=σ2{r(q+m)+Jm+Jlog(en/J)}.

We give some examples of to illustrate the conclusion. Using the indicator function (u)=1u1, for any estimator (B̂,Ĉ), M(B̂-B*,Ĉ-C*)σ2{r(q+m)+Jm+Jlog(en/J)} holds with positive probability. For (u)=u, Theorem 5 shows that the risk E{M(B^B,C^C)} is bounded from below by Po(J,r) up to some multiplicative constant. Therefore, (17) attains the minimax optimal rate up to a mild logarithmic factor, showing the advantage of utilizing redescending ψ-functions in robust low-rank estimation. The analysis is non-asymptotic and applies to any n, p and m.

Convex methods are however sometimes useful. In some less challenging problems, where some incoherence regularity condition is satisfied by the augmented design matrix, Huber’s ψ can achieve the same low error rate. The result of the following theorem can be extended to any subadditive penalties with the associated ψ sandwiched between Huber’s ψ and ψH.

 
Theorem 6
Let (B^,C^)=argmin(B,C)YXBCF2/2+λC2,1 subject to r(B)r, with λ=Aσ(m+logn)1/2 where A is a large enough constant. Then
E{M(B^B,C^C)}M(BB,CC)+σ2+σ2(q+m)r+σ2K2J(C)(m+logn)
(18)
for any (B,C) with rank(B)r, if given J=J(C), X satisfies (1+ϑ)CJ2,1CJc2,1+K|J|1/2(IPr)CF for all C and all Pr such that PrPX and r(Pr)2r, where K0 and ϑ is a positive constant.

Compared with (16), (18) has an additional factor of K on the right-hand side. Some numerical experiments on the magnitude of K, presented in the Supplementary Material, show that the error bound obtained in Theorem 6 is comparable to (16) in some settings. Also, under a different regularity condition, an estimation error bound on B* can be obtained. See the Supplementary Material for more details.

 
Remark 7
The results obtained can be used to argue the necessity of robust estimation when outliers occur. Similar to Theorem 3, we can show that the ordinary reduced-rank regression, which sets Ĉ=0, satisfies
E{M(B^B,C^C)}infr(B)rXB(XB+C)F2+σ2(q+m)r+σ2.
(19)
Taking r=r*, the error bound of the reduced-rank regression, evaluated at the optimal B satisfying XB=XB*+PXB*C* and r(B)r, is of order
σ2(q+m)r+(IPXB)CF2.
(20)

Because XB* has low rank, I-PXB* is not null in general. Notable outliers that can affect the projection subspace in performing rank reduction tend to occur in the orthogonal complement of the range of XB*, and so (20) can be arbitrarily large, which echoes the deterministic breakdown-point conclusion in Theorem 1.

To control the size of the bias term, a better way is to use a larger rank value in the presence of outliers. Concretely, setting B=B*+(XTX)-XTC* in (19) yields
σ2Jq+σ2Jm+σ2(q+m)r+(IPX)CF2,
(21)
where we have used r(B)r*+J*. When p>n we have PX=I, and so (21) offers an improvement over (20) by giving a finite error rate of σ2J*q+σ2J*m+σ2(q+m)r*. But our robust reduced-rank regression guarantees a consistently lower rate at σ2J*logn+σ2J*m+σ2(q+m)r*, since σ2J*qσ2J*logn. The performance gain can be dramatic in big-data applications, where the design matrix is huge and typically multiple outliers are bound to occur.

4. Computation and tuning

In this section we show that compared with the M-characterization in Theorem 2, the additive formulation (6) simplifies computation and parameter tuning. Let us consider a penalized form of the robust reduced-rank regression problem
minB,CF(B,C)=12YXBCF2+i=1nP(ci2;λ)subject to r(B)r.
(22)

The penalties of interest may be nonconvex in light of the theoretical results in § 3, as stringent incoherence assumptions associated with convex penalties can be much relaxed or even removed. Assuming that P is constructed by (5), a simple procedure for solving (22) is as described in Algorithm 1, where the two matrices C and B are alternately updated with the other held fixed until convergence. Here the multivariate thresholding, Θ, is defined based on Θ; cf. Definitions 1 and 2.

 
Algorithm 1

A robust reduced-rank regression algorithm.

Input X, Y, C(0), B(0), Θ, t=0

Repeat

  • (a) tt+1

  • (b) C(t+1)Θ(Y-XB(t);λ)

  • (c) B(t+1)R(X,Y-C(t+1),r), as defined in (3)

Until convergence

Step (b) performs simple multivariate thresholding operations and Step (c) performs reduced-rank regression on the adjusted response matrix Y-C(t+1). We need not explicitly compute B to update C in the iterative process. In fact, we need only XB(t), which depends on X through PX, or I when pn. The eigenvalue decomposition called in (3) has low computational complexity because the rank values of practical interest are often small. Algorithm 1 is simple to implement and cost-effective. For example, even for p=1200 and n=m=100, it takes only about 40 seconds to compute a whole solution path for a two-dimensional grid of 100 λ values and 10 rank values.

 
Theorem 7

Let Θ be an arbitrary thresholding rule, and let F be as defined in (22), where P is associated with Θ through (5). Then, given any λ0 and r0, the proposed algorithm has the property that F(B(t),C(t))F(B(t+1),C(t+1)) for all t, and so F(B(t),C(t)) converges as t. Furthermore, under the assumptions that Θ(;λ) is continuous in the closure of {Y-XB(t)} and {B(t)} is uniformly bounded, any accumulation point of (B(t),C(t)) is a coordinatewise minimum point, and is a stationary point when q(;λ)0; hence F(B(t),C(t)) converges monotonically to F(B*,C*) for some coordinatewise minimum point (B*,C*).

The algorithm can be slightly modified to deal with (8), (12), and (13). For example, we can replace Θ by Θ, applied componentwise, to handle elementwise outliers. The 0-penalized form with P(C;λ)=(λ2/2)C2,0, as well as the constrained form (12), will be used in data analysis and simulation. In implementation, they correspond to applying hard-thresholding and quantile-thresholding operators (She et al., 2013).

In common with most high-breakdown algorithms in robust statistics, we recommend using the multi-sampling iterative strategy (Rousseeuw & van Driessen, 1999). In many practical applications, however, we have found that the initial values can be chosen rather freely. Indeed, Theorem 8 shows that if the problem is regular, our algorithm guarantees low statistical error even without the multi-start strategy.

In the following theorem, given Θ, define LΘ=1essinf{dΘ1(u;λ)/du:u0}, where essinf is the essential infimum. By definition, LΘ1. We use P2,Θ(C;λ) as shorthand for i=1nPΘ(ci2;λ) and set r=(1+α)r* with α0 and r*1.

 
Theorem 8

Let (B̂,Ĉ) be any solution satisfying B̂=R(X,Y-Ĉ,r) and Ĉ=Θ(Y-XB̂;λ) with B̂ of rank r and Θ continuous at Y-XB̂. Let Θ be associated with a bounded nonconvex penalty as described in Corollary 1 and let λ=Aσ(m+logn)1/2 with A being a large enough constant. Assume that (1+α)1/2XBXBF2+LΘCCF2+ϑP2,H(CC;λ)(2δ)M(BB,CC)+2P2,Θ(C;λ)+ζP2,0(C;λ) holds for all (B,C) satisfying r(B)r, where ζ0, δ>0 and ϑ>0 are constants. Then E{M(B̂-B*,Ĉ-C*)}σ2(1+α)(q+m)r*+σ2J*m+σ2J*logn.

To choose an optimal rank for B and an optimal row support for C jointly, crossvalidation would seem to be an option. However, it lacks theoretical support in the robust low-rank setting, and for large-scale problems crossvalidation can be quite expensive. Motivated by Theorem 5, we propose the predictive information criterion
logYXBCF2+1mn[A1{Jm+(m+qr)r}+A2Jlog(en/J)],
(23)
where YXBCF2 is the residual sum of squared errors, r=r(B), J=C2,0, and e denotes the Euler constant. The term Jm+(m+q-r)r counts the degrees of freedom of the obtained model, and Jlog(en/J) characterizes the risk inflation. The benefits of the criterion include that no noise scale parameter needs to be estimated, and minimizing (23) achieves the minimax optimal error rate when the true model is parsimonious, as shown below.
 
Theorem 9

Let P(B,C)=Jm+(m+q-r)r+Jlog(en/J), where r=r(B) and J=C2,0. Suppose that the true model is parsimonious in the sense that P(B*,C*)<mn/A0 for some constant A0>0. Let δ(B,C)=AP(B,C)/(mn) where A is a positive constant satisfying A<A0, and so δ(B*,C*)<1. Then for sufficiently large values of A0 and A, any (B̂,Ĉ) that minimizes logYXBCF2+δ(B,C) subject to δ(B,C)<1 satisfies M(B̂-B*,Ĉ-C*)σ2{J*m+(m+q-r*)r*+J*log(en/J*)} with probability at least 1-c1n-c1-c2exp(-c2mn) for some constants c1,c1,c2,c2>0.

Based on computer experiments, we set A1=7 and A2=2.

5. Arabidopsis thaliana data

We performed extensive simulation studies to compare our method with some classical robust multivariate regression approaches and several reduced-rank methods (Reinsel & Velu, 1998; Tatsuoka & Tyler, 2000; Aelst & Willems, 2005; Roelant et al., 2009; Bunea et al., 2011; Mukherjee & Zhu, 2011) in both low and high dimensions. The results are reported in the Supplementary Material; our robust reduced-rank regression shows performance comparable or superior to the other methods in terms of both prediction and outlier detection.

Isoprenoids are diverse and abundant compounds in plants, where they serve many important biochemical functions and play roles in respiration, photosynthesis and the regulation of growth and development. To examine the regulatory control mechanisms in gene networks for isoprenoids in Arabidopsis thaliana, a genetic association study was conducted, with n=118 GeneChip microarray experiments performed to monitor gene expression levels under various experimental conditions (Wille et al., 2004). It was experimentally verified that strong connections exist between some downstream pathways and two isoprenoid biosynthesis pathways. We therefore considered a multivariate regression set-up, with the expression levels of p=39 genes from the two isoprenoid biosynthesis pathways serving as predictors, and the expression levels of m=62 genes from four downstream pathways, namely plastoquinone, carotenoid, phytosterol and chlorophyll, serving as the responses.

Because of the small sample size relative to the number of unknowns, we applied robust reduced-rank regression with the predictive information criterion for parameter tuning. The final model is of rank five, which means that the effective number of unknowns is reduced by about 80% compared with the least-squares model. Interestingly, our method also identified two outliers, samples 3 and 52. Figure 1 shows the detection paths by plotting the 2-norm of each row in the C estimates for a sequence of λ values. The two unusual samples are distinctive. The outlyingness could be caused by different experimental conditions. In particular, sample 3 was the only sample with Arabidopsis tissue culture in a baseline experiment. The two outliers have a surprisingly large impact on both coefficient estimation and model prediction. This can be seen from B̂-B˜F/B˜F50% and XB̂-XB˜F/XB˜F26%, where B̂ and B˜ denote, respectively, the robust reduced-rank regression and the plain reduced-rank regression estimates. In addition, Fig. 1 reveals that sample 27 could be a potential outlier meriting further investigation.

Fig. 1.

Arabidopsis thaliana data: outlier detection paths obtained by the robust reduced-rank regression. Sample 3 and sample 52 are captured as outliers, whose paths are shown as a dotted line and a dashed line, respectively. The path plot also suggests sample 27 as a potential outlier.

The low-rank model obtained reveals robust score variables, or factors, constructed from isoprenoid biosynthesis pathways, in response to the 62 genes in the four downstream pathways. Let X˜ denote the design matrix after removing the two detected outliers, and let ÛD̂V̂T be the singular value decomposition of X˜B̂. Then Û delivers five orthogonal factors, and V̂D̂ gives the associated factor coefficients. Figure 2 plots the coefficients of the first three leading factors for all 62 response variables. Given the sth factor (s=1,2,3), the genes are grouped into the four pathways separated by vertical lines, and two horizontal lines are placed at heights ±σsX˜B̂m-1/2. Therefore, the genes located beyond those two horizontal lines have relatively large-magnitude coefficients on the corresponding factor.

Fig. 2.

Arabidopsis thaliana data: factor coefficients of the 62 response genes from plastoquinone, carotenoid, phytosterol, and chlorophyll pathways. From left to right the panels correspond to the top three factors estimated by the robust reduced-rank regression. For the sth factor (s=1,2,3), two horizontal lines are plotted at heights ±σsX˜B̂m-1/2, and three vertical lines separate the genes into four different pathways.

We also tested the significance of the factors in response to each of the 62 genes; see Table 1. Plastoquinone was excluded since it has only two genes and its behaviour couples with that of carotenoid most of the time. Even with the familywise error rate controlled at 0.01, the factors obtained are predictive overall according to the significance percentages, although they play very different roles in different pathways. In fact, according to Fig. 2 and Table 1, the genes that are correlated with the first factor are mainly from carotenoid and chlorophyll, and almost all the coefficients there are negative. It seems that the first factor interprets some joint characteristics of carotenoid and chlorophyll; the second factor differentiates phytosterol genes from carotenoid genes; and the third factor appears to contribute mainly to the phytosterol pathway. Therefore, by projecting the data onto a proper low-dimensional subspace in a supervised and robust manner, distinct behaviours of the downstream pathways and their potential subgroup structures can be revealed. Further biological insights could be gained by closely examining the experimental and background conditions.

Table 1.

Arabidopsis thaliana data: percentage of genes on each response pathway that show significance of a given factor, with the familywise error rate controlled at level0.01

PathwayNumber of genesFactor 1Factor 2Factor 3
Carotenoid1155%73%9%
Phytosterol2520%48%32%
Chlorophyl2475%21%0%
PathwayNumber of genesFactor 1Factor 2Factor 3
Carotenoid1155%73%9%
Phytosterol2520%48%32%
Chlorophyl2475%21%0%
Table 1.

Arabidopsis thaliana data: percentage of genes on each response pathway that show significance of a given factor, with the familywise error rate controlled at level0.01

PathwayNumber of genesFactor 1Factor 2Factor 3
Carotenoid1155%73%9%
Phytosterol2520%48%32%
Chlorophyl2475%21%0%
PathwayNumber of genesFactor 1Factor 2Factor 3
Carotenoid1155%73%9%
Phytosterol2520%48%32%
Chlorophyl2475%21%0%

Acknowledgement

The authors thank the editor, associate editor and referees for suggestions that have significantly improved the paper. The first author thanks Art Owen for his valuable comments on an earlier version of the manuscript, and Elvezio Ronchetti for inspiration. This work was partially supported by the U.S. National Science Foundation, National Institutes of Health and Department of Defense.

Supplementary material

Supplementary material available at Biometrika online includes all technical details of the proofs, a financial example and the simulation studies.

References

Aelst
S. V.
&
Willems
G.
(
2005
).
Multivariate regression S-estimators for robust estimation and inference.
Statist. Sinica
15
,
981
1001
.

Agarwal
A.
,
Negahban
S.
&
Wainwright
M. J.
(
2012
).
Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions.
Ann. Statist.
40
,
1171
97
.

Anderson
T. W.
(
1951
).
Estimating linear restrictions on regression coefficients for multivariate normal distributions.
Ann. Math. Statist.
22
,
327
51
.

Bunea
F.
,
She
Y.
&
Wegkamp
M.
(
2011
).
Optimal selection of reduced rank estimators of high-dimensional matrices.
Ann. Statist.
39
,
1282
309
.

Candès
E. J.
,
Li
X.
,
Ma
Y.
&
Wright
J.
(
2011
).
Robust principal component analysis?
J. Assoc. Comp. Mach.
58
,
1
37
.

Chen
K.
,
Dong
H.
&
Chan
K.-S.
(
2013
).
Reduced rank regression via adaptive nuclear norm penalization.
Biometrika
100
,
901
20
.

Donoho
D.
&
Huber
P. J.
(
1983
).
The notion of breakdown point. In
A Festschrift for Erich L. Lehmann,
Bickel
P. J.
Doksum
K.
&
Hodges
J. L.
eds.,
Wadsworth Statistics/Probability Series. Belmont,
California
:
Wadsworth,
pp.
157
84
.

Donoho
D. L.
&
Johnstone
I. M.
(
1994
).
Ideal spatial adaptation by wavelet shrinkage.
Biometrika
81
,
425
55
.

Fan
J.
&
Li
R.
(
2001
).
Variable selection via nonconcave penalized likelihood and its oracle properties.
J. Am. Statist. Assoc.
96
,
1348
60
.

Foygel
R.
,
Horrell
M.
&
Lafferty
M. D. J.
(
2012
).
Nonparametric reduced rank regression.
Adv. Neural Info. Proces. Syst.
25
,
1637
45
.

Hampel
F. R.
,
Ronchetti
E. M.
,
Rousseeuw
P. J.
&
Stahel
W. A.
(
2005
).
Robust Statistics: The Approach Based on Influence Functions
.
New York
:
Wiley
.

Huber
P.
(
1981
).
Robust Statistics
.
New York
:
Wiley
.

Izenman
A. J.
(
1975
).
Reduced-rank regression for the multivariate linear model.
J. Mult. Anal.
5
,
248
64
.

Izenman
A. J.
(
2008
).
Modern Multivariate Statistical Techniques
.
New York
:
Springer
.

Koltchinskii
V.
,
Lounici
K.
&
Tsybakov
A.
(
2011
).
Nuclear norm penalization and optimal rates for noisy low rank matrix completion.
Ann. Statist.
39
,
2302
29
.

Lounici
K.
,
Pontil
M.
,
van de Geer
S.
&
Tsybakov
A. B.
(
2011
).
Oracle inequalities and optimal inference under group sparsity.
Ann. Statist.
39
,
2164
204
.

Mukherjee
A.
&
Zhu
J.
(
2011
).
Reduced rank ridge regression and its kernel extensions.
Statist. Anal. Data Mining
4
,
612
22
.

Reinsel
G. C.
&
Velu
P.
(
1998
).
Multivariate Reduced-Rank Regression: Theory and Applications
.
New York
:
Springer
.

Roelant
E.
,
Aelst
S. V.
&
Croux
C.
(
2009
).
Multivariate generalized S-estimators.
J. Mult. Anal.
100
,
876
87
.

Rohde
A.
&
Tsybakov
A.
(
2011
).
Estimation of high-dimensional low-rank matrices.
Ann. Statist.
39
,
887
930
.

Rousseeuw
P. J.
&
van Driessen
K.
(
1999
).
A fast algorithm for the minimum covariance determinant estimator.
Technometrics
41
,
212
23
.

She
Y.
(
2012
).
An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors.
Comp. Statist. Data Anal.
56
,
2976
90
.

She
Y.
(
2013
).
Reduced rank vector generalized linear models for feature extraction.
Statist. Interface
6
,
197
209
.

She
Y.
&
Owen
A. B.
(
2011
).
Outlier detection using nonconvex penalized regression.
J. Am. Statist. Assoc.
106
,
626
39
.

She
Y.
,
Wang
J.
,
Li
H.
&
Wu
D.
(
2013
).
Group iterative spectrum thresholding for super-resolution sparse spectral selection.
IEEE Trans. Sig. Proces.
61
,
6371
86
.

Tatsuoka
K.
&
Tyler
D.
(
2000
).
The uniqueness of S-functionals and M-functionals under nonelliptical distributions.
Ann. Statist.
28
,
1219
43
.

Vounou
M.
,
Nichols
T. E.
&
Montana
G.
(
2010
).
Discovering genetic associations with high-dimensional neuroimaging phenotypes: A sparse reduced-rank regression approach.
NeuroImage
53
,
1147
59
.

Wille
A.
,
Zimmermann
P.
,
Vranova
E.
,
Furholz
A.
,
Laule
O.
,
Bleuler
S.
,
Hennig
L.
,
Prelic
A.
,
von Rohr
P.
,
Thiele
L.
et al.  (
2004
).
Sparse graphical Gaussian modeling of the isoprenoid gene network in
Arabidopsis thaliana
.
Genome Biol.
5
,
R92
.

Wright
J.
,
Ganesh
A.
,
Rao
S.
,
Peng
Y.
&
Ma
Y.
(
2009
).
Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization.
In
Advances in Neural Information Processing Systems 22
,
Bengio
Y.
Schuurmans
D.
Lafferty
J.
Williams
C. K. I.
&
Culotta
A.
eds.
Neural Information Processing Systems (NIPS),
pp.
2080
8
.

Ye
F.
&
Zhang
C.-H.
(
2010
).
Rate minimaxity of the lasso and Dantzig selector for the lq loss in lr balls.
J. Mach. Learn. Res.
11
,
3519
40
.

Yee
T.
&
Hastie
T. J.
(
2003
).
Reduced rank vector generalized linear models.
Statist. Mod.
3
,
367
78
.

Yuan
M.
,
Ekici
A.
,
Lu
Z.
&
Monteiro
R.
(
2007
).
Dimension reduction and coefficient estimation in multivariate linear regression.
J. R. Statist. Soc.
B
69
,
329
46
.

Zhang
C.-H.
(
2010a
).
Nearly unbiased variable selection under minimax concave penalty.
Ann. Statist.
38
,
894
942
.

Zhang
T.
(
2010b
).
Analysis of multi-stage convex relaxation for sparse regularization.
J. Mach. Learn. Res.
11
,
1081
107
.

Zhou
Z.
,
Li
X.
,
Wright
J.
,
Candès
E.
&
Ma
Y.
(
2010
).
Stable principal component pursuit.
In
Proc. 2010 IEEE Int. Symp. Info. Theory
.
Institute of Electrical and Electronics Engineers,
pp.
1518
22
.

Supplementary data