An overview of multi-task learning

As a promising area in machine learning, multi-task learning (MTL) aims to improve the performance of multiple related learning tasks by leveraging useful information among them. In this paper, we give an overview of MTL by first giving a definition of MTL. Then several different settings of MTL are introduced, including multi-task supervised learning, multi-task unsupervised learning, multi-task semi-supervised learning, multi-task active learning, multi-task reinforcement learning, multi-task online learning and multi-task multi-view learning. For each setting, representative MTL models are presented. In order to speed up the learning process, parallel and distributed MTL models are introduced. Many areas, including computer vision, bioinformatics, health informatics, speech, natural language processing, web applications and ubiquitous computing, use MTL to improve the performance of the applications involved and some representative works are reviewed. Finally, recent theoretical analyses for MTL are presented.


INTRODUCTION
Machine learning, which exploits useful information in historical data and utilizes the information to help analyze future data, usually needs a large amount of labeled data for training a good learner. One typical learner in machine learning is deep-learning models, which are neural networks with many hidden layers and also many parameters; these models usually need millions of data instances to learn accurate parameters. However, some applications such as medical image analysis cannot satisfy this requirement since it needs more manual labor to label data instances. In these cases, multi-task learning (MTL) [1] is a good recipe by exploiting useful information from other related learning tasks to help alleviate this data sparsity problem.
As a promising area in machine learning, MTL aims to leverage useful information contained in multiple learning tasks to help learn a more accurate learner for each task. Based on an assumption that all the tasks, or at least a subset of them, are related, jointly learning multiple tasks is empirically and theoretically found to lead to better performance than learning them independently. Based on the nature of the tasks, MTL can be classi-fied into several settings, including multi-task supervised learning, multi-task unsupervised learning, multi-task semi-supervised learning, multi-task active learning, multi-task reinforcement learning, and multi-task online learning. In multi-task supervised learning, each task, which can be a classification or regression problem, is to predict labels for unseen data instances given a training dataset consisting of training data instances and their labels. In multitask unsupervised learning, each task, which can be a clustering problem, aims to identify useful patterns contained in a training dataset consisting of data instances only. In multi-task semi-supervised learning, each task is similar to that in multi-task supervised learning with the difference that the training set includes not only labeled data but also unlabeled ones. In multi-task active learning, each task exploits unlabeled data to help learn from labeled data similar to multi-task semi-supervised learning but in a different way by selecting unlabeled data instances to actively query their labels. In multi-task reinforcement learning, each task aims to choose actions to maximize the cumulative reward. In multi-task online learning, each task handles sequential data. In multi-task multi-view learning, each task handles REVIEW Zhang and Yang 31 multi-view data in which there are multiple sets of features to describe each data instance. MTL can be viewed as one way for machines to mimic human learning activities since people often transfer knowledge from one task to another and vice versa when these tasks are related. One example from our own experience is that the skills for playing squash and tennis can help improve each other. Similar to human learning, it is useful to learn multiple learning tasks simultaneously since the knowledge in a task can be utilized by other related tasks.
MTL is related to other areas in machine learning, including transfer learning [2], multi-label learning [3] and multi-output regression, but exhibits different characteristics. For example, similar to MTL, transfer learning also aims to transfer knowledge from one task to another but the difference lies in that transfer learning hopes to use one or more tasks to help a target task while MTL uses multiple tasks to help each other. When different tasks in multi-task supervised learning share the training data, it becomes multi-label learning or multi-output regression. In this sense, MTL can be viewed as a generalization of multi-label learning and multioutput regression.
In this paper, we give an overview of MTL. We first briefly introduce MTL by giving its definition. After that, based on the nature of each learning task, we discuss different settings of MTL, including multi-task supervised learning, multi-task unsupervised learning, multi-task semi-supervised learning, multi-task active learning, multi-task reinforcement learning, multi-task online learning and multi-task multi-view learning. For each setting of MTL, representative MTL models are presented. When the number of tasks is large or data in different tasks are located in different machines, parallel and distributed MTL models become necessary and several models are introduced. As a promising learning paradigm, MTL has been applied to several areas, including computer vision, bioinformatics, health informatics, speech, natural language processing, web applications and ubiquitous computing, and several representative applications in each area are presented. Moreover, theoretical analyses for MTL, which can give us a deep understanding of MTL, are reviewed.
The remainder of this paper is organized as follows. The section entitled 'Multi-task learning' introduces the definition of MTL. From the section entitled 'Multi-task supervised learning' to that entitled 'Multi-task multi-view learning', we give an overview of different settings in MTL, including multi-task supervised learning, multi-task unsupervised learning, multi-task semi-supervised learning, multi-task active learning, multi-task reinforcement learning, multi-task online learning and multi-task multi-view learning. The section entitled 'Parallel and distributed MTL' discusses parallel and distributed MTL models. The section entitled 'Applications of multi-task learning' shows how MTL can help other areas and that entitled 'Theoretical analysis' focuses on theoretical analyses of MTL. Finally, the section entitled 'Conclusions' concludes the whole paper. 1

MULTI-TASK LEARNING
To start with, we give a definition of MTL.
where all the tasks or a subset of them are related but not identical, multi-task learning aims to help improve the learning of a model for T i by using the knowledge contained in the m tasks.
Based on this definition, we can see that there are two elementary factors for MTL.
The first factor is the task relatedness. The task relatedness is based on the understanding of how different tasks are related, which will be encoded into the design of MTL models, as we will see later.
The second factor is the definition of task. In machine learning, learning tasks mainly include supervised tasks such as classification and regression tasks, unsupervised tasks such as clustering tasks, semisupervised tasks, active learning tasks, reinforcement learning tasks, online learning tasks and multi-view learning tasks. Hence different learning tasks lead to different settings in MTL, which is what the following sections focus on. In the following sections, we will review representative MTL models in different MTL settings.

MULTI-TASK SUPERVISED LEARNING
The multi-task supervised learning (MTSL) setting means that each task in MTL is a supervised learning task, which models the functional mapping from data instances to labels. Mathematically, suppose there are m supervised learning tasks T i for i = 1, ..., m and each supervised task is associated with a train- where each data instance x i j lies in a d-dimensional space and y i j is the label for x i j . So, for the ith task T i , there are n i pairs of data instances and labels. When y i j is in a continuous space or equivalently a real scalar, the corresponding task is a regression task and if y i j is discrete, i.e. y i j ∈ {−1, 1}, the corresponding task is a classification task. 32 Natl Sci Rev, 2018, Vol. 5, No. 1 REVIEW MTSL aims to learn m functions { f i (x)} m i =1 for the m tasks from the training set such that f i (x i j ) is a good approximation of y i j for all the i and j. After learning the m functions, MTSL uses f i (·) to predict labels of unseen data instances from the ith task.
As discussed before, the understanding of task relatedness affects the design of MTSL models. Specifically, existing MTSL models reflect the task relatedness in three aspects: feature, parameter and instance, leading to three categories of MTSL models including feature-based, parameter-based, and instance-based MTSL models. Specifically, featurebased MTSL models assume that different tasks share identical or similar feature representations, which can be a subset or a transformation of the original features. Parameter-based MTSL models aim to encode the task relatedness into the learning model via the regularization or prior on model parameters. Instance-based MTSL models propose to use data instances from all the tasks to construct a learner for each task via instance weighting. In the following, we will review representative models in the three categories.

Feature-based MTSL
In this category, all MTL models assume that different tasks share a feature representation, which is induced by the original feature representation. Based on how the shared feature representation appears, we further categorize multi-task models into three approaches, including the feature transformation approach, the feature selection approach and the deeplearning approach. The feature transformation approach learns the shared feature representation as a linear or nonlinear transformation of the original features. The feature selection approach assumes that the shared feature representation is a subset of the original features. The deep-learning approach applies deep neural networks to learn the shared feature representation, which is encoded in the hidden layers, for multiple tasks.

Feature transformation approach
In this approach, the shared feature representation is a linear or nonlinear transformation of the original feature representation. A representative model is the multi-layer feedforward neural network [1] and an example of a multi-layer feedforward neural network is shown in Fig. 1. In this example, the multi-layer feedforward neural network consists of an input layer, a hidden layer, and an output layer. The input layer has d units to receive data instances from the m tasks as inputs with one unit for a feature. The hidden layer contains multiple nonlinear activation units and receives the transformed output of the input layer as the input where the transformation depends on the weights connecting the input and hidden layers. As a transformation of the original features, the output of the hidden layer is the feature representation shared by all the tasks. The output of the hidden layer is first transformed based on the weights connecting the hidden and output layers, and then fed into the output layer, which has m units, each of which corresponds to a task. Unlike multi-layer feedforward neural networks, which are based on neural networks, the multitask feature learning (MTFL) method [5,6] and the multi-task sparse coding (MTSC) method [7] are formulated under the regularization framework by first transforming data instances asx i j = U T x i j and then learning a linear function as f i (x i j ) = (a i ) Txi j + b i . Based on this formulation, we can see that these two methods aim to learn a linear transformation U instead of the nonlinear transformation in multi-layer feedforward neural networks. Moreover, for the MTFL and MTSC methods, there exist several differences. For example, in the MTFL method, U is supposed to be orthogonal and the parameter matrix A = (a 1 , . . . , a m ) is row-sparse via the 2,1 regularization, while in the MTSC method, U is overcomplete, implying that the number of columns in U is much larger than the number of rows, and A is sparse via the 1 regularization.

Feature selection approach
The feature selection approach aims to select a subset of original features as the shared feature representation for different tasks. There are two ways to do the multi-task feature selection. The first way is based on the regularization on W = (w 1 , . . . , w m ), where f i (x) = (w i ) T x + b i defines the linear learning function for T i , and another one is based on sparse probabilistic priors on W. In the following, we will give details of these two ways.
Among all the regularized methods for multi-task feature selection, the most widely used technique is p, q regularization to minimize W p,q , the p, q norm of W, plus the training loss on the training REVIEW Zhang and Yang 33 set, where w j denotes the jth row of W, · q denotes the q norm of a vector, and W p,q equals ( w 1 p , . . . , w d p ) q . The effect of the p, q regularization is to make W row-sparse and hence some unimportant features for all the tasks can be filtered out. Concrete instances of the p, q regularization include the 2,1 regularization proposed in [8,9] and the ∞,1 regularization proposed in [10]. In order to obtain a smaller subset of useful features for multiple tasks, a cappedp,1 penalty, which is defined , is proposed in [11]. It is easy to see that when θ becomes large enough, this cappedp,1 penalty will degenerate to the p,1 regularization. Besides the p, q regularization, there is another type of regularized method, which can select a feature for MTL. For example, in [12], a multi-level lasso is proposed by decomposing w ji , the (j, i)th entry in W, as w j i = θ jŵ j i . It is easy to see that when θ j equals 0, w j becomes a zero row, implying that the jth feature is not useful for all the tasks, and hence θ j is an indicator of the usefulness of the jth feature for all the tasks. Moreover, whenŵ j i becomes 0, w ji will also become 0 and henceŵ j i is an indicator of the usefulness of the jth feature for T i only. By regularizing θ j andŵ j i via the 1 norm to enforce them to be sparse, the multi-level lasso can learn sparse features in two levels. This model is extended in [13,14] to more general settings.
For multi-task feature selection methods based on the p,1 regularization, a probabilistic interpretation is proposed in [15], which shows that the p,1 regularizer corresponds to a prior: w j i ∼ GN (0, ρ j , p), where GN (·, ·, ·) denotes the generalized normal distribution. Then this prior is extended in [15] to the matrix-variate generalized normal prior to learn relations among tasks and identify outlier tasks simultaneously. In [16,17], the horseshoe prior is utilized to select features for MTL. The difference between [16] and [17] is that in [16], the horseshoe prior is generalized to learn feature covariance, while in [17], the horseshoe prior is used as a basic prior and the whole model is to identify outlier tasks in a way different from [15].

Deep-learning approach
Similar to the multi-layer feedforward neural network model in the feature transformation approach, basic models in the deep-learning approach include advanced neural network models such as convolutional neural networks and recurrent neural networks. However, unlike the multi-layer feedforward neural network with a small number of hidden layers (e.g. 2 or 3), the deep-learning approach involves neural networks with tens of or even hundreds of hidden layers. Moreover, similar to the multi-layer feedforward neural network, most deeplearning models [18][19][20][21][22] in this category treat the output of one hidden layer as the shared feature representation. Unlike these deep models, the cross-stitch network proposed in [23] combines the hidden feature representations of two tasks to construct more powerful hidden feature representations. Specifically, given two deep neural networks A and B with the same network architecture for two tasks, where x A i, j and x B i, j denote the hidden features contained in the jth unit of the ith hidden layer for networks A and B, the cross-stitch operation on x A i, j and x B i, j can be defined as ) as well as the parameters in the two networks are learned from data via the back propagation method and hence this method is more flexible than directly sharing hidden layers.

Parameter-based MTSL
Parameter-based MTSL uses model parameters to relate the learning of different tasks. Based on how the model parameters of different tasks are related, we classify them into five approaches, including the low-rank approach, the task-clustering approach, the task-relation learning approach, the dirty approach and the multi-level approach. Specifically, since tasks are assumed to be related, the parameter matrix W is likely to be low-rank, which is the motivation for the low-rank approach. The task-clustering approach aims to divide tasks into several clusters and all the tasks in a cluster are assumed to share identical or similar model parameters. The task-relation learning approach directly learns the pairwise task relations from data. The dirty approach assumes the decomposition of the parameter matrix W into two component matrices, each of which is regularized by a type of the sparsity. As a generalization of the dirty approach, the multi-level approach decomposes the parameter matrix into more than 2 component matrices to model complex relations among all the tasks. In the following sections, we will discuss each approach in detail.

Low-rank approach
Similar tasks usually have similar model parameters, which makes W likely to be low-rank. In [24], the model parameters of the m tasks are assumed to share a low-rank subspace, leading to a REVIEW parametrization of w i as w i = u i + T v i , where ∈ R h×d is a low-rank subspace shared by all the tasks with h < d and u i is specific to task T i . With an assumption on that is orthonormal (i.e. T = I where I denotes an identity matrix with an appropriate size) to remove the redundancy, u i , v i and are learned by minimizing the training loss on all the tasks. This model is then generalized in [25] by adding a squared Frobenius regularization on W and this generalized model can be relaxed to have a convex objective function.
Based on the analysis in optimization, regularizing with the trace norm, which is defined as can make a matrix low-rank and hence trace-norm regularization is widely used in MTL with [26] as a representative work. Similar to what the cappedp,1 penalty did to the p,1 norm, a variant of the trace-norm regularization called the capped-trace regularizer is proposed in [27] and defined as where θ is a parameter defined by users. Based on θ , only small singular values of W will be penalized and hence it can lead to a matrix with a lower rank. When θ becomes large enough, the capped-trace regularizer will reduce to the trace norm.

Task-clustering approach
The task-clustering approach applies the idea of data-clustering methods to group tasks into several clusters, each of which has similar tasks in terms of model parameters.
The first task-clustering algorithm proposed in [28] decouples the task-clustering procedure and the model-learning procedure. Specifically, it first clusters tasks based on the model parameters learned separately under the single-task setting and then pools the training data of all the tasks in a task cluster to learn a more accurate learner for all the tasks in this task cluster. This two-stage method may be suboptimal since model parameters learned under the single-task setting may be inaccurate, making the task-clustering procedure not so good. So follow-up research aims to identify the task clusters and learn model parameters together.
A multi-task Bayesian neural network, whose structure is similar to that of the multi-layer neural network shown in Fig. 1, is proposed in [29] to cluster tasks based on the Gaussian mixture model in terms of model parameters (i.e. weights connecting the hidden and output layers). The Dirichlet process, which is widely used in Bayesian learning to do data clustering, is employed in [30] to do task clustering based on model parameters {w i }.
Unlike [29,30], which are Bayesian models, there are several regularized methods [31][32][33][34][35] to do task clustering. Inspired by the k-means clustering method, Jacob et al. [31] devise a regularizer, i.e. tr(W −1 W T ), to identify task clusters by considering between-cluster and within-cluster variances, where tr(·) gives the trace of a square matrix, denotes an m × m centering matrix, A B for two square matrices A, B means that B − A is positive semidefinite (PSD), and with three hyperparameters α, β, γ , is required to satisfy αI βI and tr( ) = γ . The MTFL method is extended in [32] to the case of multiple clusters, where each cluster applies the MTFL method, and in order to learn the cluster structure, a regularizer, i.e. r i =1 WQ i 2 S(1) , is employed, where a 0/1 diagonal matrix Q i satisfying r i =1 Q i = I can help identify the structure of the ith cluster. In order to automatically determine the number of clusters, a structurally sparse regularizer, j >i w i − w j 2 , is proposed in [34] to enforce any pair of model parameters to be fused. After learning the parameter matrix W, the cluster structure can be determined by comparing whether w i − w j 2 is below a threshold or not for any pair (i, j). Both works [33,35] decompose W as W = LS where columns in L consist of basis parameter vectors in different clusters and S contains combination coefficients. Both methods penalize the complexity of L via the squared Frobenius norm but they learn S in different ways. Specifically, the method in [33] aims to identify overlapping task clusters where each task can belong to multiple clusters and hence it learns a sparse S via the 1 regularization, while in [35], each task lies in only one cluster and hence the 2 norm of each column in the 0/1 matrix S is enforced to be 1.

Task-relation learning approach
In this approach, task relations are used to reflect the task relatedness and some examples for the task relations include task similarities and task covariances, just to name a few.
In earlier studies on this approach, task relations are either defined by model assumptions [36,37] or given by a priori information [38][39][40][41]. These two ways are not ideal and practical since model assumptions are hard to verify for real-world applications and a priori information is difficult to obtain. A more advanced way is to learn the task relations from data, which is the focus of this section.
A multi-task Gaussian process is proposed in [42] to define a prior on f i j , the functional value corresponding to x i j , as f ∼ N (0, ), where f = ( f 1 1 , . . . , f m n m ) T . The entry in corresponding to the covariance between f i j and f p q is defined as where k( ·, ·) defines a kernel function and ω ip is the covariance between REVIEW Zhang and Yang 35 tasks T i and T p . Then, based on the Gaussian likelihood on labels given f , the marginal likelihood, which has an analytical form, is used to learn , the task covariance to reflect the task relatedness, with its (i, p)th entry as ω ip . In order to utilize Bayesian averaging to achieve better performance, a multi-task generalized t process is proposed in [43] by placing an inverse-Wishart prior on . A regularized model called multi-taskrelationship learning (MTRL) method is proposed in [44,45] by placing a matrix-variate normal prior on W: W ∼ MN (0, I, ), where MN (M, A, B) denotes a matrix-variate normal distribution with M, A, B as the mean, row covariance and column covariance. This prior corresponds to a regularizer tr(W −1 W T ) where the PSD task covariance is required to satisfy tr( ) ≤ 1. The MTRL method is generalized to multi-task boosting [46] and multi-label learning [47], where each label is treated as a task, and extended to learn sparse task relations in [48]. A model similar to the MTRL method is proposed in [49] by assigning a prior on W as W ∼ MN (0, 1 , 2 ), and it learns the sparse inverse of 1 and 2 . Since the prior used in the MTRL method implies that W T W follows a Wishart distribution as W(0, ), the MTRL method is generalized in [50] by studying a high-order prior: (W T W) t ∼ W(0, ), where t is a positive integer. In [51], a similar regularizer to that of the MTRL method is proposed by assuming a parametric form of as −1 = (I m − A)(I m − A) T , where A is an asymmetric task relation claimed in [51]. Unlike the aforementioned methods, which rely on global learning models, local learning methods such as the k-nearest-neighbor (kNN) classifier are extended in [52] to the multi-task setting and the learning function is defined where N k (i, j) denotes the set of task and instance indices for k nearest neighbors of x i j , s(·, ·) defines the similarity between instances, and σ ip represents the similarity of task T p to T i . By enforcing σ ip to be close to σ pi , a regularizer − T 2 F is proposed in [52] to learn task similarities, where each σ ip needs to satisfy that σ ii ≥ 0 and |σ ip | ≤ σ ii for i = p.

Dirty approach
The dirty approach assumes the decomposition of the parameter matrix W as W = U + V, where U and V capture different parts of the task relatedness. The objective functions of different models in this approach can be unified to minimize the training loss on all the tasks as well as two regularizers, g (U) and h(V), on U and V, respectively. Hence, the different methods belonging to this approach differ in the choices of g (U) and h(V).
Here we introduce five methods in this approach, i.e. [53][54][55][56][57]. Different choices of g (U) and h(V) for the five methods are shown in Table 1. Based on Table 1, we can see that the choices of g (U) in [53,56] make U row-sparse via the ∞,1 and 2,1 norms, respectively. The choices of g (U) in [54,55] enforce U to be low-rank via the trace norm as the regularizer and constraint, respectively. Unlike these methods, g (U) in [57] penalizes its complexity via the squared Frobenius norm and clusters feature in different tasks based on the fused lasso regularizer. For V, h(V) makes it sparse via the 1 norm in [53,54] and column-sparse via the 2,1 norm in [55,56], while in [57], h(V) penalizes the complexity of V via the squared Frobenius norm.
In the decomposition, U mainly identifies the task relatedness among tasks similar to the feature selection approach or low-rank approach while V is capable of capturing noises or outliers via the sparsity. The combination of U and V can help the learner become more robust.

Multi-level approach
As a generalization of the dirty approach, the multilevel approach decomposes the parameter matrix where the number of levels, h, is no smaller than 2. In the following, we show how the multi-level decomposition can help model complex task structures.
In the task-clustering approach, different task clusters usually have no overlap, which may restrict the expressive power of the resulting learners. In [58], all possible task clusters are enumerated, Table 1. Choices of g(U) and h (V) for different methods in the dirty approach.
Downloaded from https://academic.oup.com/nsr/article-abstract/5/1/30/4101432 by guest on 28 July 2018 REVIEW leading to 2 m − 1 task clusters, and they are organized in a tree with the root node as a dummy node, where the parent-child relation in the tree is the 'subset of' relation. This tree has 2 m nodes, each of which corresponds to a level, and hence an index t denotes both a node in the tree and the corresponding level. In order to handle a tree with such a large number of nodes, authors make an assumption that if a cluster is not useful then none of its supersets are either, which means that if a node in the tree is not helpful then none of its descendants are either. Based on this assumption, a regularizer based on the squared p,1 norm is devised, where V denotes the set of nodes in the tree, λ v is a regularization parameter for node v, and D(v) denotes the set of descendants of v. Here s (W t ) uses the regularizer proposed in [36] to enforce different columns in W t to be close to their average. Unlike [58] where each level involves a subset of tasks, a multi-level taskclustering method is proposed in [34] to cluster all the tasks at each level based on a structurally sparse In [59], each component matrix is assumed to be jointly sparse and row-sparse but in different proportions, which are more similar in successive component matrices. In order to achieve this, a regularizer, i.e. h i =1 Unlike the aforementioned methods where different component matrices have no direct interaction, in [60], with direct connections between component matrices at successive levels, the complex hierarchical/tree structure among tasks can be learned from data. Specifically, built on the multi-level taskclustering method [34], a sequential constraint, i.e. |w [60] to help make the whole structure become a tree.
Compared with the dirty approach that focuses on identifying noises or outliers, the multi-level approach is capable of modeling more complex task structures such as complex task clusters and tree structures.

Instance-based MTSL
There are few works in this category with the multi-task distribution matching method proposed in [61] as a representative work. Specifically, it first estimates the ratio between probabilities that each instance is from its own task and from a mixture of all the tasks. After determining ratios via softmax functions, this method uses ratios to determine the instance weights and then learns model parameters for each task based on weighted instances from all the tasks.

Discussion
Feature-based MTSL can learn a common feature representation for different tasks and it is more suitable for applications whose original feature representation is not so informative and discriminative, e.g. in computer vision, natural language processing and speech. However, feature-based MTSL can easily be affected by outlier tasks that are unrelated to other tasks, since it is difficult to learn a common feature representation for outlier tasks that are unrelated to each other. Given a good feature representation, parameter-based MTSL can learn more accurate model parameters and it is more robust to outlier tasks via a robust representation of model parameters. Hence featurebased MTSL is complemental to parameter-based MTSL. Instance-based MTSL, which is currently being explored, seems parallel to the other two categories.
In summary, the MTSL setting is the most important one in the research of MTL since it sets the stage for research in other settings. Among the existing research efforts in MTL, about 90% of works study the MTSL setting, while in the MTSL setting, the feature-based and parameter-based MTSL attract most attention from the community.

MULTI-TASK UNSUPERVISED LEARNING
Unlike multi-task supervised learning where each data instance is associated with a label, in multi-task unsupervised learning, the training set D i of the ith task consists of only n i data instances {x i j } and the goal of multi-task unsupervised learning is to exploit the information contained in D i . Typical unsupervised learning tasks include clustering, dimensionality reduction, manifold learning, visualization and so on, but multi-task unsupervised learning mainly focuses on multi-task clustering. Clustering is to divide a set of data instances into several groups, each of which has similar instances, and hence multi-task clustering aims to conduct clustering on multiple datasets by leveraging useful information contained in different datasets.
Not very many studies on multi-task clustering exist. In [62], two multi-task-clustering methods are proposed. These two methods extend the MTFL and MTRL methods [5,44], two models in the MTSL setting, to the clustering scenario and the formulations in the proposed two multi-task-clustering methods are almost identical to those in the MTFL and MTRL methods, with the only difference being REVIEW Zhang and Yang 37 that the labels are treated as unknown cluster indicators that need to be learned from data.

MULTI-TASK SEMI-SUPERVISED LEARNING
In many applications, data usually require a great deal of manual labor to label, making labeled data not so sufficient, but in many situations, unlabeled data are ample. So in this case, unlabeled data are utilized to help improve the performance of supervised learning, leading to semi-supervised learning, whose training set consists of a mixture of labeled and unlabeled data. In multi-task semi-supervised learning, the goal is the same, where unlabeled data are used to improve the performance of supervised learning while different supervised tasks share useful information to help each other. Based on the nature of each task, multi-task semisupervised learning can be classified into two categories: multi-task semi-supervised classification and multi-task semi-supervised regression. For multitask semi-supervised classification, a method proposed in [63,64] follows the task-clustering approach to do task clustering on different tasks based on a relaxed Dirichlet process, while in each task, random walk is used to exploit useful information contained in the unlabeled data. Unlike [63,64], a semi-supervised multi-task regression method is proposed in [65], where each task adopts a Gaussian process and unlabeled data are used to define the kernel function, and Gaussian processes in all the tasks share a common prior on kernel parameters.

MULTI-TASK ACTIVE LEARNING
The setting of multi-task active learning, where each task has a small number of labeled data and a large amount of unlabeled data in the training set, is almost identical to that of multi-task semi-supervised learning. However, unlike multitask semi-supervised learning, which exploits information contained in the unlabeled data, in multi-task active learning, each task selects informative unlabeled data to query an oracle to actively acquire their labels. Hence the criterion for the selection of unlabeled data is the main research focus in multi-task active learning [66][67][68].
Specifically, two criteria are proposed in [66] to make sure that the selected unlabeled instances are informative for all the tasks instead of only one task. Unlike [66], in [67] where the learner in each task is a supervised latent Dirichlet allocation model, the selection criterion for unlabeled data is the expected error reduction. Moreover, a selection strategy, a tradeoff between the learning risk of a low-rank MTL model based on the trace-norm regularization and a confidence bound similar to multi-armed bandits, is proposed in [68].

MULTI-TASK REINFORCEMENT LEARNING
Inspired by behaviorist psychology, reinforcement learning studies how to take actions in an environment to maximize the cumulative reward and it shows good performance in many applications with AlphaGo, which beats humans in the Go game, as a representative application. When environments are similar, different reinforcement learning tasks can use similar policies to make decisions, which is a motivation of the proposal of multi-task reinforcement learning [69][70][71][72][73].
Specifically, in [69], each reinforcement learning task is modeled by a Markov decision process (MDP) and different MDPs in all the tasks are related via a hierarchical Bayesian infinite mixture model. In [70], each task is characterized via a regionalized policy and a Dirichlet process is used to cluster tasks. In [71], the reinforcement learning model for each task is a Gaussian process temporaldifference value function model and a hierarchical Bayesian model relates value functions of different tasks. In [72], the value functions in different tasks are assumed to share sparse parameters and it applies the multi-task feature selection method with the 2,1 regularization [8] and the MTFL method [5] to learn all the value functions simultaneously. In [73], an actor-mimic method, which is a combination of deep reinforcement learning and model compression techniques, is proposed to learn policy networks for multiple tasks.

MULTI-TASK ONLINE LEARNING
When the training data in multiple tasks come in a sequential way, traditional MTL models cannot handle them but multi-task online learning is capable of doing this job, as shown in some representative works [74][75][76][77][78][79].
Specifically, in [74,75], where different tasks are assumed to have a common goal, a global loss function, a combination of individual losses on each task, measures the relations between tasks, and by using absolute norms for the global loss function, several online MTL algorithms are proposed. In [76], the proposed online MTL algorithms model task relations by placing constraints on actions taken for all the tasks. In [77], online MTL algorithms, which adopt perceptrons as a basic model and measure task relations based on shared geometric structures REVIEW among tasks, are proposed for multi-task classification problems. In [78], a Bayesian online algorithm is proposed for a multi-task Gaussian process that shares kernel parameters among tasks. In [79], an online algorithm is proposed for the MTRL method [44] by updating model parameters and task covariance together.

MULTI-TASK MULTI-VIEW LEARNING
In some applications such as computer vision, each data point can be described by different feature representations; one example is image data, whose features include SIFT and wavelet, to name just a few. In this case, each feature representation is called a view and multi-view learning, a learning paradigm in machine learning, is proposed to handle such data with multiple views. Similar to supervised learning, each multi-view data point is usually associated with a label. Multi-view learning aims to exploit useful information contained in multiple views to further improve the performance over supervised learning, which can be considered as a single-view learning paradigm. As a multi-task extension of multi-view learning, multi-task multi-view learning [80,81] hopes to exploit multiple multiview learning problems to improve the performance over each multi-view learning problem by leveraging useful information contained in related tasks.
Specifically, in [80], the first multi-task multiview classifier is proposed to utilize the task relatedness based on common views shared by tasks and view consistency among views in each task. In [81], different views in each task achieve consensus on unlabeled data and different tasks are learned by exploiting a priori information as in [38] or learning task relations as the MTRL method did.

PARALLEL AND DISTRIBUTED MTL
When the number of tasks is large, if we directly apply a multi-task learner, the computational complexity may be high. Nowadays the computational capacity of a computer is very powerful due to the multi-CPU or multi-GPU architecture involved. So we can make use of these powerful computing facilities to devise parallel MTL algorithms to accelerate the training process. In [82], a parallel MTL method is devised to solve a subproblem of the MTRL model [44], which also occurs in many regularized methods belonging to the task-relation learning approach. Specifically, this method utilizes the FISTA algorithm to design a decomposable surrogate function with respect to all the tasks and this surrogate function can be parallelized to speed up the learning process. Moreover, three loss functions, including the hinge, -insensitive and square losses, are studied in [82], making this parallel method applicable to both classification and regression problems in MTSL.
In some cases, training data for different tasks may exist in different machines, which makes it difficult for conventional MTL models to work, even though all the training data can be moved to one machine, which incurs additional transmission and storage costs. A better option is to devise distributed MTL models that can directly operate on data distributed on multiple machines. In [83], a distributed algorithm is proposed based on a debiased lasso model and by learning one task in a machine, this algorithm achieves efficient communications.

APPLICATIONS OF MULTI-TASK LEARNING
Several areas, including computer vision, bioinformatics, health informatics, speech, natural language processing, web applications and ubiquitous computing, use MTL to boost the performance of their respective applications. In this section, we review some related works.

Computer vision
The applications of MTL in computer vision can be divided into two categories, including image-based and video-based applications.

Speech and natural language processing
Applications of MTL in speech include speech synthesis [114,115] and those for natural language processing include joint learning of six NLP tasks (i.e. part-of-speech tagging, chunking, named entity recognition, semantic role labeling, language modeling and semantically related words) [116], multi-domain sentiment classification [117], multidomain dialog state tracking [21], machine translation [118], syntactic parsing [118], and microblog analysis [119,120].

Web applications
Web applications based on MTL include learning to rank in web searches [121], web search ranking [122], multi-domain collaborative filtering [123], behavioral targeting [124], and conversion maximization in display advertising [125].

Ubiquitous computing
Applications of MTL in ubiquitous computing include stock prediction [126], multi-device localization [127], the inverse dynamics problem for robotics [128,129], estimation of travel costs on road networks [130], travel-time prediction on road networks [131], and traffic-sign recognition [132].

THEORETICAL ANALYSIS
Learning theory, an area in machine learning, studies the theoretical aspect of learning models including MTL models. In the following, we introduce some representative works. The theoretical analysis in MTL mainly focuses on deriving the generalization bound of MTL models. It is well known that the generalization performance of MTL models on unseen test data is the main concern in MTL and machine learning. However, since the underlying data distribution is difficult to model, the generalization performance cannot be computed and instead the generalization bound is used to provide an upper bound for the generalization performance.
The first generalization bound for MTL is derived in [133] for a general MTL model. Then there are many studies to analyze generalization bounds of different MTL approaches, including e.g. [7,134] for the feature transform approach, [135] for the feature selection approach, [24,[135][136][137][138] for the lowrank approach, [136] for the task-relation learning approach, and [138] for the dirty approach.

CONCLUSIONS
In this paper, we give an overview of MTL. Firstly, we give a definition of MTL. After that, different settings of MTL are presented, including multi-task supervised learning, multi-task unsupervised learning, multi-task semi-supervised learning, multi-task active learning, multi-task reinforcement learning, multi-task online learning and multi-task multi-view learning. For each setting, we introduce its representative models. Then parallel and distributed MTL models, which can help speed up the learning process, are discussed. Finally, we review the applications of MTL in various areas and present theoretical analyses for MTL.
Recently deep learning has become popular in many applications and several deep models have been devised for MTL. Almost all the deep models just share hidden layers for different tasks; this way of sharing knowledge among tasks is very useful when all the tasks are very similar, but when this assumption is violated, the performance will significantly deteriorate. We think one future direction for multi-task deep models is to design more flexible architectures that can tolerate dissimilar tasks and even outlier tasks. Moreover, the deep-learning, task-clustering and multi-level approaches lack theoretical foundations and more analyses are needed to guide the research in these approaches.