The Information Complexity of Learning Tasks, their Structure and their Distance

We introduce an asymmetric distance in the space of learning tasks, and a framework to compute their complexity. These concepts are foundational to the practice of transfer learning, ubiquitous in Deep Learning, whereby a parametric model is pre-trained for a task, and then used for another after fine-tuning. The framework we develop is intrinsically non-asymptotic, capturing the finite nature of the training dataset, yet it allows distinguishing learning from memorization. It encompasses, as special cases, classical notions from Kolmogorov complexity, Shannon, and Fisher Information. However, unlike some of those frameworks, it can be applied easily to large-scale models and real-world datasets. It is the first framework to explicitly account for the optimization scheme, which plays a crucial role in Deep Learning, in measuring complexity and information.


Introduction
The widespread use of Deep Learning is due in part to its flexibility: One can pre-train a deep neural network for a task, say finding cats and dogs in images, and then fine-tune it for another, say detecting tumors in a mammogram, or controlling a self-driving vehicle. Sometimes it works. So far, however, it has not been possible to predict whether such a transfer learning practice will work, and how well. Even the most fundamental questions are still unanswered: How far are two tasks? In what space do tasks live? What is the complexity of a learning task? How difficult it is to transfer from one task to another? In this paper, we lay the foundations for answering these questions.
Summary of contributions 1. We introduce a distance between learning tasks (Section 4), where tasks are represented by finite datasets of input data and discrete output classes (labels). Each task can have a different number of labels and a different number of samples. The distance behaves properly with respect to composition and inclusion relations (Corollary 4.5) and can handle corner cases, such as learning random labels, where other distances such as Kolmogorov's fail (Example 3.4). The distance is asymmetric by design, as it may be easier to learn one task starting from the solution to another, than the reverse (Definition 4.1).
leverages classical results from Kolmogorov's complexity theory [13], classical statistical inference, and information theory, and relates to recent theoretical frameworks for Deep Learning including the Information Bottleneck Principle [12], with important differences that we outline throughout the paper.

Preliminaries and nomenclature
In supervised learning, one is given a a finite (training) dataset D = {(x i , y i )} i=1,...,N of N samples, where x i ∈ X is the input data (e.g., an image) and y i ∈ Y is the output (e.g., a label). The goal is to learn (i.e., estimate the parameters w of) a model p w (a parametric function) that maps inputs x to estimated outputsŷ, so that some loss (or risk) is minimized on unseen (test) data. It is common to assume that D consists of i.i.d. samples from some unknown distribution p(x, y), which are used to assemble a loss function, that is minimized with respect to the parameters w so that p w (y|x) is close to the "true" posterior p(y|x).
In this work however, we make no assumption on the data generation process, nor do we attempt to approximate the true posterior p(y|x), even when we consider datasets composed of samples from some unknown distribution. Instead, we adopt Kolmogorov's approach to learning the "structure" in the data, directly accounting for the finite sample size N , the approximation error in the loss L D , and the complexity of the model. The essential elements of Kolmogorov Complexity theory that we use are summarized in [13]. Deep neural networks (DNNs) are a particularly powerful and efficient class of parametrized models, obtained by successive compositions (layers) of linear multiplication by weight (parameter) matrices, and simple element-wise non-linear operations. In Deep Learning, the optimization scheme acts as an implicit regularizer in the loss. For a measure of complexity to be relevant to Deep Learning, therefore, it must take into account both the loss function and the optimization scheme. To the best of our knowledge, none of the classical measures do so.

Deep Neural Networks
Deep neural networks are a class of functions (models) that implement a successive composition (layers) of affine operations (where both the linear term and the offset are considered as weights) and a non-linearity (such as a saturation or rectification). One of the most common non-linearities is the rectified linear unit (ReLU), which leaves the positive part unchanged and zeroes the negative part. The first layer is thus of the form x → h(W x + b), where: (W, b) are the weights, or improperly the "weight vector"; h(x) is defined as the component-wise maximum between 0 and x. The output of the first layer is therefore given by z = h(W x + b). The second layer performs an operation of the same type, taking the output of the first layer as input, and with a different set of weights, and so on. The last layer produces a probability vector p w (y|x), with components y ∈ {1, . . . , K}, usually through a soft-max operation, i.e., z → e z / k e z k . The k-th entry of this probability vector represents the probability p w (y = k|x) of the input x being of class y = k, as assessed by the model.
The learning criterion is given by maximum likelihood:ŵ = arg max w L D (p w ), where L D (p w ) is the loss function and is given by The loss function needs to be regularized, explicitly or implicitly, since the number of weights is typically larger than the number of samples in the training set. The loss function can be interpreted as an empirical approximation of the average cross-entropy H p,pw (y|x), which is zero when p w (y|x) = p(y|x).
The training process consists of minimizing the empirical cross-entropy using stochastic gradient descent (SGD). At every iteration, SGD makes a step in the direction of the (negative) gradient of L D , which is approximated using a random subset of the training set (a minibatch). The length of the step is a hyperparameter called the learning rate.

Complexity of a Learning Task
A learning task is specified by a dataset D. However, the same task could be specified by different datasets. Their complexity is not just related to the size of the input, the output, or the number of samples: the popular MNIST and CIFAR-10 classification tasks are similar on these counts, yet one is very easy to learn, the other is not. Instead, the complexity of a dataset is a function of two factors: The underlying structure shared among data points, as well as variability that individual points exhibit relative to the shared structure. This split is not unique: a dataset can have many different explanations, each leaving a different amount of residual variability. These two factors are captured in the following definition.
We define the complexity of D as is the empirical classification (cross-entropy) loss, and the minimum is over all possible computable probability distributions p(y|x) of the label y given the input x. By K(p) we denote the Kolmogorov complexity of the distribution p(y|x).
At first sight, C(D) appears similar to a conditioned version of the two-part code in [13,Appendix II]: where y = y 1 , . . . , y N and x = x 1 , . . . , x N are strings obtained by concatenating all labels and inputs of D, respectively. In Proposition 3.2.1, we recall that C K (D) coincides with the conditional Kolmogorov complexity K(y|x) of the string y given x. Instead, in eq. (1) we only consider factorized distributions p(y|x) = i p(y i |x i ). This has major consequences for determining the complexity of a task: The distribution p(y|x) minimizing eq. (2) does not need to encode in p all task-relevant information, as it can, at test time, extract any missing information from the training set x (cf. Proposition 3.2.3). Hence, K(p) in eq. (2) alone would not be a valid measure of complexity. Instead, the distribution p(y|x) minimizing eq. (1) can only access a single datum x at test time; hence, all structure in the dataset has to be encoded in p. Moreover note that, unlike C K (D), C(D) is invariant to permutations of the dataset. Suppose that a dataset D for a binary classification task is ordered so that all negative samples D − precede the positive samples D + . Then we would have C K (D) ≤ log |D − | + log |D + | regardless of the complexity of the task, as it suffices to encode the number of negative and positive samples to reproduce the string y exactly. However, we show in Proposition 3.2.3 that making p(y|x) permutation invariant does not yield a sensible measure of complexity of a task. To address permutation invariance, [8] proposes the following definition, which uses deterministic functions rather than probability distributions: The following proposition compares these definitions of complexity.
Proposition 3.2 (Measures of complexity). Up to an additive term which does not depend on the dataset D, we have: 2. C K (π(D)) ≤ C(D) for any permutation π(D) of D.
3. For every C > 0, there is a dataset D such that C(D) ≥ C and C K (π(D)) = O(1) for any permutation π(D). Therefore, C(D) is not simply the (average) complexity C K (π(D)) of encoding some permutation of D.

When
Assuming the data are sampled i.i.d. from a computable probability distribution p, the following proposition characterizes the complexity of the dataset. In particular, it shows that, asymptotically, the complexity of the dataset is given by the noise in the labels and the complexity of the distribution generating the data. How to ignore the effect of noise on the complexity, and what happens in the non-asymptotic regime, is central to Kolmogorov's framework, which we will build upon in the next sections.
Proposition 3.3. Fix a probability distribution p(x, y) on X × Y, and assume that p(y|x) is computable. If D is a collection of N i.i.d. samples (x i , y i ) ∼ p(x, y), then: 1. The expected value of C(D) satisfies where H p (y|x) is the conditional entropy of y given x.
2. For any > 0 there is N 0 such that, with probability 1 − , for any N ≥ N 0 we have the equality and p is the only computable distribution for which the equality holds.
It is instructive to test our definition of complexity on a dataset of random labels, whereby each input is assigned a label at random. We will revisit this case often, as it challenges many of the extant theories of Deep Learning [15].
Example 3.4 (Complexity of random labels). Suppose that each input x i of the dataset D is associated to a label y i ∈ Y sampled uniformly at random in a fixed finite set, so p(y|x) = 1/|Y| has a constant complexity. Under the same assumptions of Proposition 3.2.4, the expected value of for any such D, the complexity of a "typical" dataset with random labels is approximately N log |Y|.
In a sense, learning random labels is a very complex task: They cannot be predicted from the input, so the model is forced to memorize them ("overfitting"). Accordingly, C(D) in Example 3.4 is very high. However, by construction there is no structure in the data, so the model cannot generalize to unseen data. In other words, this is a complex memorization task, but a trivial learning task, as there is nothing to learn from the data. We would like the definition of complexity to reflect this, and to differentiate between learning and memorization. Another important aspect not captured by this definition of complexity is the role of performance in learning the structure in the data. For example, one can train a trivial model to distinguish airplanes from fireplaces by counting the number of blue pixels. It will not be very precise. To achieve a small error, however, one must learn what makes an airplane different from a fireplace even if the latter is painted blue.

Structure Function of a Task
The trade-off between the loss achievable by a solution p on a dataset D and its complexity is captured by the Structure Function [13]: It is a decreasing function that reaches zero for sufficiently high complexity, depending on the task: As we increase complexity, the loss decreases rapidly while simple models correctly classify easy samples. After all shared structure is captured, the only way to further reduce the loss is to memorize ever samples, leading to the worst possible trade-off of one NAT of complexity for each unit of decrease in loss. Eventually, every dataset enters this linear (overfitting) regime. For random labels, this happens at the outset. The lower bound for S D (t) can be achieved by memorizing the label of t/ log |Y| data points.
The Structure Function of a dataset cannot be computed in general ([13, Section VII]). In Section 5 we introduce a generalized version that can: Figure 2 shows the result on common datasets. The predicted fast decrease in the loss as complexity increases is followed by the asymptotic linear phase. The sharp phase transition to the linear regime is clearly visible as a function of the loss: As the parameter β weighting complexity increases, a plateau is reached that depends on the task. Note that for random labels the loss decreases linearly as expected (left).

Task Lagrangian and Minimal Sufficiency
The constrained optimization problem in the definition of the Structure Function (eq. (4)) has an associated Lagrangian L D (p) + βK(p), where β is a Lagrange multiplier that trades off the complexity of the model K(p) with the fidelity L D (p). If we take the minimum over p, we obtain a family of complexity measures parametrized by β: As a function of β, this is the Legendre transform of the Structure Function S(t). To minimize L D (p) + βK(p), we can increase the complexity K(p) of the model until the return of doing so has a ratio which is smaller than the constant β we have selected.
If p * is a minimizer of eq. (5) for β = 1, the corresponding Kolmogorov complexity t * = K(p * ) is the value at which the Structure Function S(t) reaches a linear regime. Thus, the special case β = 1 marks the transition to overfitting, and is related to Kolmogorov's notion of Minimal Sufficient Statistic [13]. Since we are using Kolmogorov's complexity, β = 1 is the worst possible trade-off. Proposition 3.6. Given a task D, let β * be the largest β for which C β (D) is not realized by a constant distribution p(y|x) = p(y). Then β * ≥ 1, and β * = 1 if D is a typical dataset with random labels.
As we have seen for random labels, a dataset may be complex and yet exhibit little underlying structure. We say that a distribution p(y|x) is a Kolmogorov sufficient statistic of D if it minimizes C(D). It is minimal if it also minimizes K(p) among all sufficient statistics, that is, the smallest statistic that is able to solves the task optimally. The rationale is that the smallest statistic that solves a task should not squander resources by memorizing nuisance variability. Rather, it should only capture the important information shared among the data. This is shown in the following example.
Example 3.7. For random labels, both the distribution p(y|x) that memorizes all the labels in the dataset, and the uniform distribution p(y|x) = 1/|Y|, are sufficient statistics. However, only the latter is minimal, since K(p) is a constant which does not depend on D. There is no structure to be extracted from a dataset of random labels.
The level of complexity of a model is an important design parameter the practitioner wishes to control. Rather than seeking minimal sufficient statistics, we explore the entire trade space, by introducing the notion of β-sufficiency.
Definition 3.8. Given a dataset D, define a β-sufficient statistic of D as a probability distribution p(y|x) such that C β (D) = L D (p) + βK(p). We say that p(y|x) is a β-minimal sufficient statistic if it also minimizes K(p) among all β-sufficient statistics.
Notice that, for β = 1, Definition 3.8 reduces to minimal sufficiency in the sense of Kolmogorov.

Asymmetric Distance between Tasks
We now introduce a one-parameter family of distances between tasks. The hope is for them to correlate with the ease of transfer learning. Since it is easier to learn a simple task from a complex one, it is desirable for the distance to be asymmetric.
Definition 4.1. The asymmetric distance between tasks D 1 and D 2 at level β is where p i varies among all β-minimal sufficient statistics of D i .
The intuition behind this definition is the following: for a task D 2 to be close to D 1 , every β-minimal sufficient statistic p 1 of D 1 should be close to some β-minimal sufficient statistic p 2 of D 2 . Then, every optimal model p 1 of D 1 can be fine-tuned to some optimal model p 2 of D 2 .
Lemma 4.2. The asymmetric distance d β satisfies the following properties: We now derive a characterization of the distance between tasks based on the complexity of their composition, amenable to generalization in Section 6. Denote by D 1 D 2 the disjoint union of two datasets D 1 and D 2 , defined as Notice that an index i is added to the input, in order to recognize the original dataset. A desirable property for a distance between tasks would be that d β (D 1 D 2 → D 1 ) = O(1): Indeed, a model that performs well on D 1 D 2 should be easily fine-tuned to a model that performs well on D 1 alone. Adding the index i in the definition of D 1 D 2 is essential, as we can see in the following example. Example 4.3. Let D be a typical dataset with random labels, and let D 1 ⊆ D be the set of data points (x, y) ∈ D satisfying some property of Kolmogorov complexity t 0. Then the whole dataset D has a trivial structure, whereas D 1 has a complicated structure. If both |D 1 | and |D \ D 1 | are large, by Proposition 3.3 the Kolmogorov complexity of a minimal sufficient statistic of D 1 is t, and the complexity of a minimal sufficient statistic of D is O(1).
We now prove that, under the hypotheses of Proposition 3.3.2, the property Theorem 4.4. Suppose that D 1 and D 2 are obtained by sampling from two fixed distributions p 1 and p 2 on X × Y, such that p 1 (y|x) and p 2 (y|x) are computable. Then, with high probability and for where p 12 varies among the β-minimal sufficient statistics of D 12 and p 12 varies among those of D 1 .
We now have a way of comparing different learning tasks, at least in theory. The asymmetric distance d β (D 1 → D 2 ) allows us to quantify how difficult it is to learn a new task D 2 given a solution to D 1 . However, quantities defined in terms of Kolmogorov complexity are difficult to handle in practice, and may behave well only in an asymptotic regime. In the next section, we introduce a generalization of the framework developed so far that can be instantiated for a particular model class such as deep networks.

Information in the Model Parameters
Whatever structure or "information" was captured from the dataset, it ought to be measurable from the model parameters, since they are all we have left after training. As we will see in the next section, this intuition is faulty, as how we converge to a set of parameter (i.e., the optimization algorithm) also affects what information we can extract from the data. For now, we focus on generalizing the theory in the previous section with an eye towards computability. Although most of our arguments are general, we focus on deep neural networks (DNNs) as a model class. They can have millions of parameters, so measuring their information can be non-trivial.
One way to compute information in the parametersis to measure their coding length at some level of precision, independent of the particular task. This is suboptimal, as only a small subset of the weights of a trained neural networks matters: Imagine changing a certain component of the weights, and observing no change in the loss. Arguably, that weight "contains no information" about the dataset. For the purpose of storing the trained model, that weight could be replaced with any constant, or randomized each time the network is used. The loss landscape has small curvature 1 in the coordinate direction corresponding to that weight. On the other hand, imagine changing the least significant bit of another component of the weights and noticing a large increase in the loss. That weight is very "informative," so it is useful to store its value with high precision.
With these observations in mind, we allow the weights to be encoded with some uncertainty, through a probability distribution Q(w|D) which depends on the dataset D. For example, Dirac's Delta Q(w|D) = δ w * corresponds to an exact encoding of the weight vector w * . If we fix a reference "prior" distribution P (w), [7] shows that the labels y can be reconstructed from the input x and the prior P (w), by using additional NATS. This expression resembles the right-hand side of eq. (1) in capturing a trade-off between fidelity and complexity. Here, Kolmogorov complexity has been replaced by the Kullbach-Liebler (KL) divergence KL( Q(w|D) P (w) ) which we call the information in the parameters of the model. 2 This leads to the following new definition of complexity.
Definition 5.1. The complexity of the task D at level β, using the posterior Q(w|D) and the prior P (w), is given by The second term, KL( Q(w|D) P (w) ), measures the information in the parameters of the model. We refer to E w∼Q(w|D) [L D (p w (y|x))] as the (expected) reconstruction error of the label under the hypothesis Q(w|D).
We call Q(w|D) a "posterior" as it is a distribution decided after seeing the dataset D. There is no implied Bayesian interpretation, as Q(w|D) can be any distribution. Similarly, P (w) is a "prior" because it is picked before the dataset is seen. Depending on the choice, this expression can be computed in closed form or estimated (Section 5.1). For instance, when Q(w|D) = δ w * , the expression reduces to the length of a two-part code for D using the model class p w (y|x). However, Definition 5.1 is more general and can be extended to the continuous case, or in cases where there is a bona fide distribution, as in variational inference and Bayesian Neural Networks. Another fundamental difference is that, while eq. (1) measures the complexity in terms of the best obtainable by the model class (in that case, the class of computable probability distributions), the complexity C β (D; P, Q) takes into account both the particular model class and the training algorithm, i.e., the map A : D → Q(w|D), as we shall see in Section 6.2.

Relation with Kolmogorov, Shannon, and Fisher Information
Since the choice of prior P (w) in Definition 5.1 is arbitrary, we investigate three special cases: The "universal prior" of all computable distributions; an "adapted prior" which relies on a probability distribution over datasets; an uninformative prior, agnostic of the dataset.
We start with the first case, which provides a link between Definition 5.1 and the framework of Section 3. For a given weight vector w, we define the universal prior P (w) = 1 Z e −K(w) , where Z is a normalization constant. This can be interpreted as follows: for every w, choose a minimal program that outputs w, and assign it a probability which decreases exponentially in terms of the length of the program. Proposition 5.2 (Kolmogorov Complexity of the Weights). Let P (w) be the universal prior, and let Q(w|D) = δ w * be a Dirac delta. Then the information in the weights equals the Kolmogorov complexity of the weights K(w * ), up to a constant.
We now turn to the second case, which provides a link with Shannon mutual information.
Here I(w; D) is Shannon's mutual information between the weights and the dataset, where the weights are seen as a (stochastic) function of the dataset given by the training algorithm (e.g., SGD).
Note that, in this case, the prior P (w) is optimal given the choice of the training algorithm (i.e., the map A : D → Q(w|D)) and the distribution of training datasets π(D). However, the distribution π(D) is generally unknown, as we are often given a single dataset to train. Even if it was known, computing the marginal distribution E D [Q(w|D)] over all possible datasets would not be realistic, as it is high-dimensional and has complex interactions between different components. Nevertheless, it is interesting that the information in the parameters specializes to Shannon's mutual information in the weights [4].
The third case, namely an uninformative prior with a Gaussian posterior, is the most practical, and provides a link to the Fisher Information Matrix and the learning dynamics of common optimization algorithms such as SGD.
Theorem 5.4 (Fisher Information in the Weights). Choose an isotropic Gaussian prior P (w) ∼ N (0, λ 2 I). Let the posterior Q(w|D) be also Gaussian: Q(w|D) ∼ N (w * , Σ), where w * is a local minimum of the cross-entropy loss. Then, for λ → ∞, we have that: • the covariance Σ * which minimizes C β (D; P, Q) tends to β 2 H −1 = β 2N F −1 (this is in accordance with the Cramér-Rao bound); • the information in the weights is given by Recalling that the Fisher Information F measures the local curvature, this proposition confirms the qualitative discussion of the beginning of Section 5: The optimal covariance Σ * ∝ H −1 gives high variability to the directions of low curvature, which are "less informative," whereas it gives low variability to the "more informative" directions of high curvature. The Fisher Information describes the information contained in the weights about the dataset. In Section 6 we discuss how to get there.

Connections with the PAC-Bayes Bound
The Lagrangian C β (D; P, Q) admits another interpretation as an upper-bound to the test error, as shown by the PAC-Bayes test error bound: from a distribution p(y, x), and assume that the per-sample loss used to train is bounded by L max = 1 (we can reduce to this case by clipping and rescaling the loss). For any fixed β > 1/2, prior P (w), and weight distribution Q(w|D), with probability at least 1 − δ over the sample of D, we have: is the expected per-sample test error that the model incurs using the weight distribution Q(w|D).
Hence, we see that minimizing the Lagrangian C β (D; P, Q) can be interpreted as minimizing an upper-bound on the test error of the model, rather than directly minimizing the train error. This is in accordance with the intuition developed earlier, that minimizing C β (D; P, Q) forces the model to capture the structure of the data. It is also interesting to consider the following bound on the expectation over the sampling of D ([11, Theorem 4]): As we have seen in Proposition 5.3, for the optimal choice of prior P minimizing the bound, we have E D [KL( Q P )] = I(w; D). Hence, the Shannon Information that the weights of the model have about the dataset D is the measure of complexity that gives (on expectation) the strongest generalization bound. This has also been noted in [4]. In [6], a non-vacuous generalization bound is computed for DNNs, using (non-centered and non-isotropic) Gaussian prior and posterior distributions.
6 Generalized distance, reachability, and learnability of tasks Unlike C β (D), the definition of complexity C β (D; P, Q) in eq. (6) properly captures the complexity of a dataset for a particular model class and training algorithm. Motivated by this, we now define a distance between datasets which is tailored to the model. Throughout this section, fix a parametrized model class p w (y|x), a reference prior P (w), and a class Q of posterior distributions Q(w|D). The case we are most interested in is that of DNNs, with an uninformative prior and a Gaussian posterior (see Section 5.1). Our starting point is to generalize Kolmogorov's Structure Function framework. Consider the following generalized Structure Function: Here the minimum is taken among all posterior distributions Q(w|D) in the chosen class Q. Similarly to what we have seen in Section 3.2, this minimization problem has C β (D; P, Q) as its associated Lagrangian. We say that Q(w|D) is a β-sufficient statistic if it minimizes C β (D; P, Q). It is a β-minimal sufficient statistic if it also minimizes KL( Q(w|D) P (w) ). Motivated by Corollary 4.5, we then introduce the following distance.
Definition 6.1. The asymmetric distance between tasks D 1 and D 2 at level β is where Q 12 (w|D) is a β-minimal sufficient statistic for D 1 D 2 , and Q 1 (w|D) is a β-minimal sufficient statistic for D 1 .
While more general and amenable to computation than the distance of Definition 4.1, this is not an actual distance as it lacks several of the properties in Lemma 4.2. Nonetheless, we will show that d β does indeed capture the difficulty of fine-tuning from one dataset to another using specific model families (such as DNNs) and training algorithms (SGD), and empirically shows good correlation with other distances (e.g., taxonomical), when those are defined.

Reachability of a Task through a Local Learning Algorithm
Until now, we have only considered global minimization problems, where we aim to find the best solution satisfying some optimal trade-off. However, many learning algorithm (e.g., SGD) are local: Starting from the current solution, they take a small step to greedily minimize some objective function. While this greatly reduces the complexity of the algorithm, it raises the question of which conditions allow such an algorithm to recover an optimal solution to the task.
Given a distribution Q(w|D) ∈ Q, denote by L D (Q) the expected loss E w∼Q(w|D) [L D (p w )]. Fix a metric d on Q such that both L D (Q) and KL( Q P ) are continuous, as functions Q → R. In this way, the Lagrangian C β (D; P, Q) = L D (Q) + β KL( Q P ) is continuous in the joint variable (Q, β). In the case of DNNs, where P is the uninformative prior and Q is the class of the Gaussian distributions Q(w|D) ∼ N (w * , Σ), we can take for example d as the Wasserstein distance, or the Euclidean distance between the parameters (w * , Σ). Definition 6.2 ( -local learning algorithm). Fix β ≥ 0. We say that a step is -local if, starting from a given statistic Q 0 , it finds the statistic Q that minimizes and such that d(Q, Q 0 ) ≤ . We say that a learning algorithm is -local if it only takes -local steps.
In the limit → 0, this reduces to gradient descent on the Lagrangian C β . Notice however that this is not the same as performing gradient descent on the cross-entropy loss L D , unless β = 0. Indeed, minimizing L D , unlike minimizing C β (Section 5.2), gives no guarantees on the performance on test data, as the learning algorithm could simply memorize the dataset D. We will show in the next section than a DNN trained with SGD can actually be intepreted as a local learning algorithm minimizing C β . A natural question is: Does an -local learning algorithm always recover a global minimum of C β ?
In practice, when training a DNN with SGD, one starts with a high learning rate and anneals it throughout the training. In our framework, this corresponds to starting with a high value of β, and gradually decreasing it to a final valueβ. This helps avoiding degenerate solutions, because the model starts by favouring structure over memorization.
3. An -local learning algorithm with annealing is a learning algorithm that alternates -local steps (that change the distribution Q) and annealing steps (that can decrease the value of β).
Notice that, if the annealing is slow, then an -local learning algorithm with annealing can be regarded as a discrete gradient descent with respect to the joint variable (Q, β). The following result gives a sufficient condition for an -local learning algorithm with annealing to recover the global minimum of Cβ.
Proposition 6.4. Fix an annealing schedule β 0 ≥ β 1 ≥ · · · ≥ β n =β. Suppose that, for every global minimizer Q i of C β i , there exists a global minimizer Q i+1 of C| β i+1 with d(Q i , Q i+1 ) ≤ . Then, an -local learning algorithm with annealing that starts from a global minimizer Q 0 of C| β 0 , and performs one -local step after each annealing step, computes a global minimizer of Cβ. We say that a task that satisfies the conditions of Proposition 6.4 for some annealing schedule is -connected. In general, we cannot guarantee that an -local learning algorithm can get stuck in a local minimum of Cβ. For example, this is bound to happen if there is no sequence (Q 0 , β 0 ), (Q 1 , β 1 ), . . . , (Q n , β n ) = (Q,β) such thatQ is a global minimum of Cβ, d(Q i , Q i+1 ) ≤ , and C β i (D; P, Q i ) ≤ C β i+1 (D; P, Q i+1 ) for all i. In particular, this happens if there is no continuous path of global minima, which is an intrinsic property of the dataset D with respect to the function class.

SGD as a local learning algorithm
So far, we have introduced an abstract notion of distance between tasks. However, we have not yet shown that this notion is useful for DNNs, or that indeed SGD is a local algorithm in the sense of Definition 6.2.
In [2], a step in this direction is taken. It is shown that, in first approximation, the probability of SGD converging to a configuration w f solving task D in a given time t f − t 0 , starting from a configuration w 0 , is given by: when using the prior P (w) ∼ N (0, λ 2 I), the optimal Gaussian posterior Q(w|D) ∼ N (w * , Σ * ), and β = 2λ 2 γT . Here ∆f (w) := f (w f ) − f (w 0 ), γ is the weight decay coefficient used to train the network, and T ∝ η/B is a temperature parameter that depends on the learning rate η and the batch size B. From eq. (10) we see that, with high probability, SGD takes steps that minimize the effective potential C β (D; P, Q) = L D (w) + β KL( Q P )) (static part), while trying to minimize the distance traveled in the given time (dynamic part). Hence, SGD can be seen as a stochastic implementation of a local learning algorithm in the sense of Definition 6.2.
In particular, this has the non-intuitive implication that SGD does not greedily optimize the loss function as it may seem from its update equation. Rather, on average, the complexity of the recovered solution affects the dynamics of the optimization process, and changes the objective. Therefore, the complexity we have introduced in Section 5 is not simply an abstract mean to study different trade-offs, but rather plays a concrete role in the learning dynamics.
Since SGD can be seen as a local learning algorithm, annealing the learning rate during training (i.e., annealing the parameter β ∝ T ) can be interpreted as a way of learning the structure of a task by slowly sweeping the Structure Function. Hence, SGD with annealing adaptively changes the complexity of the model, even if its dimensionality is fixed at the outset. This creates non-trivial dynamics that turn out to be beneficial to avoid memorization. It also points to the importance of the initial transient of learning during the annealing, a prime area for investigation beyond the asymptotics of the particular value of the weights, or the local structure of the residual around them, at convergence.
Another consequence is that, when the task is not -connected, SGD may fail to recover the optimal structure of the data. One may wonder if there are examples of simple tasks which are severely non--connected, and if SGD actually fails to solve them.
A particularly interesting example comes from a biological inspiration. In [3] it is reported an example of two tasks -a dataset of blurred images and one of the same high-resolution imageswhich are apparently close to each other, but such that a network trained on the first one cannot  Figure 1: (Left) Estimated distance matrix between several tasks. Each entry shows the distance d β (D 1 → D 2 ) going from the column task D 1 to the row task D 2 . Going from a complex task like CIFAR-100 to a simpler task (like MNIST) is always easier than the converse. Subtasks are close to the full tasks (e.g., the subset of "artificial" and "natural" objects of CIFAR-100 are both close to CIFAR-100). Similar tasks on the domain of small black and white images (Fashion MNIST, MNIST, Letters) are also closer together than to natural images. Inverting the colors on Fashion images leads to a very similar task (I-Fashion), as expected.
(Right) T-SNE embedding of several organism species classification tasks and clothing attributes classification tasks based on their distance, reproduced from [1], which uses a similar definition of task distance based on the Fisher Information. Intuitively, similar tasks cluster together. In the case of species classification, this largely follows the taxonomical structure.
properly be fine-tuned to solve the second one. This peculiar phenomenon has analogues in biology, and it is shown in [3] to be indeed closely correlated with changes of the Fisher Information Matrix. Within our framework, this can be interpreted as biasing the initial optimization process toward a minimum of C β (D , P, Q) for the blurred task D , which, while also being a local minimum of the Lagrangian C β (D, P, Q) of the original task, is not close to any of the global minima. Hence, the local learning algorithm, rather than learning the correct structure, starts performing sub-optimal greedy choices. Even though the static term of eq. (10) is high (i.e., the distance is small), this example indicates that that the dynamic term can give a non-trivial contribution. This phenomenon is observed across different architectures and local optimization schemes. It opens the door to further investigations into the dynamics of differential learning.

Empirical validation
The theoretical framework developed in this paper has tangible ramifications. A robust version of the asymmetric distance between tasks described here has been used in [1], to create a metric embedding of hundreds of real-world datasets (Figure 1, right). The structure of the embedding shows good accordanc with the complexity of the tasks (Figure 2), and both with intuitive notions of similarity and with other metrics, such as taxonomical distance, when those are available (Figure 1). The metric embedding allows tackling important meta-tasks, such as predicting the performance of a pre-trained model on a new datasets, given its performance on past similar datasets. This greatly simplifies the process of recommending a pre-trained model as the starting point for fine-tuning a new Simpler datasets can still be fit for a high β, since models of low complexity can already correctly classify the data. On the other hand, more complex datasets need a very low β to fit the data: In particular, random labels have the worst trade-off. The position of the phase transition depends on the complexity of the data distribution, and is mostly independent of the dataset size: the random labeled datasets all transition at a similar point, despite the difference in size.
task, which otherwise would require running a large experiment to determine the best model. The results are also shown to improve compared to pre-training a model on the most complex available task, i.e., full ImageNet classification.
In [3], it is shown that there are tasks close to each other (with respect to our asymmetric distance) which cannot be easily fine-tuned. Indeed, there are two nearly identical tasks (classifying CIFAR images, or a slightly blurred version of them) which turn out to be unreachable from one another: Once pre-trained on the blurred data, no matter how much additional training occurs, performance on the non-blurred data remains sub-optimal.
Besides these independent validations, we also report some additional experiments that illustrate the concepts introduced in this paper. In Figure 2 (right), we plot the loss as a function of β. As predicted by Proposition 3.6, the Lagrangian C β (D; P, Q) exhibits sharp phase transitions at a critical value β * , when the model transitions from fitting the dataset with a trivial uniform distribution to actually fitting the labels. Regardless of their size, datasets of random labels transition around the same value β * . The other datasets transition at an higher value, which depends on their complexity. For example, a complex dataset such a CIFAR-100 transitions at a much lower β * than a simple dataset such as MNIST. Notice that, in this experiment, the critical value for random labels is not β * = 1: This is because the complexity is computed using an uninformative prior (see Section 5.1), and not the universal prior as in Proposition 3.6.

Discussion and Open Problems
The modern practice of Deep Learning is rich with success stories where high-dimensional parametric models are trained on large datasets, and then adapted (fine-tuned) to specific tasks. This process 0.4 0.6 0.8 L1 norm of task embedding 1e8 0% 10% 20% 30% 40% 50% 60% Test error on task (%) Figure 3: Correlation between test errorand the trace of the Fisher Information Matrix on several tasks (reproduced from [1]). The Fisher Information Matrix emerges as a complexity measure when using an uninformative prior (Section 5.1). This plot shows that that the FIM trace correlates with the error obtained on the task, and so the FIM is indeed a sensible measure of the complexity of a task.
requires considerable effort and expertise, and few tools are available to predict whether a given pre-trained model will succeed when fine-tuned for a different task. We have started developing a language to reason about transfer learning in the abstract, and analytical tools that allow to predict the success of transfer learning.
The first step is to properly define the tasks and the space they live in. We take a minimalistic approach, and identify a task with a finite dataset. The second step is to endow the space of tasks with a metric. This is non-trivial, since different datasets can have a different cardinality and dimension, and one needs to capture the fact that a simple task is usually "closer" to a complex one than vice-versa (in the sense that it is easier to fine-tune a model from a complex task to a simpler one). Thus, a notion of complexity of a learning task needs to be defined first.
We introduce a notion of complexity and a notion of distance for learning tasks that are very general, and encompass well-known analogues in Kolmogorov complexity theory and Information Theory as special cases. They espouse characteristics of Kolmogorov's framework (focusing on finite datasets rather than asymptotics) with the intuitive notions and ease of computation of Information Theory. We use deep neural networks to compute information quantities. On one hand, this provides a convenient way of instantiating our general theory. On the other hand, it allows to measure the complexity of a Deep Network, and to reason about generalization (learning vs. memorization) using a non-asymptotic language, in terms of quantities that are measurable from finite datasets.
Our theory exposes interesting connections between Deep Learning, Complexity Theory, and Information Theory, and PAC-Bayes theory. These connections represent a fertile ground for further theoretical investigation.
Much work is yet to be done. The (static) distance we introduce only gives a lower bound to the feasibility of transfer learning with deep neural networks, and dynamics plays an important role as well. In particular, there are tasks that are very close, yet there is no likely path between them, so fine-tuning typically fails. This is observed broadly across different architectures and optimization schemes, but also across different species in biology, pointing to fundamental complexity and information phenomena yet to be fully unraveled.

A Proofs
Proof of Proposition 3.2. (1) Let p be such that C K (D) = L D (p) + K(p). We can compress y using p(y|x), for example with an algebraic code of length L D (p) = − log p(y|x). A program that outputs y given x then only needs to encode the distribution p(y|x) and the code for y, requiring L D (p) + K(p) + O(1) NATS, so K(y|x) ≤ C(D). For the opposite inequality, let h be the program that witness K(y|x) and let p h (y|x) = δ h(x),y . Then K(p h ) = |h| = K(y|x) and (2) Clearly, C K (D) ≤ C(D) as C(D) minimizes over a smaller subset of distributions. Since C(D) is permutation invariant, we have C(π(D)) = C(D). Hence, C K (π(D)) ≤ C(π(D)) = C(D).
(3) Fix a function f such that K(f ) ≥ C, and let h be a program for f . Now, consider the Here, the first bit is added in order to recognize the "special" data point. We have C K (π(D)) = O(1) for any permutation π, because the concatenation of the input data of D contains an encoding of h. On the other hand, is the datased obtained from D by removing the special data point, and C(D ) = K(f ) ≥ C if N is sufficiently large.
(4) Let f : X → Y be a function such that f (x i ) = y i for every (x i , y i ) ∈ D. Consider the probability distribution p(y|x) defined by p(f (x)|x) = 1 for every x ∈ X . Then we have that To prove the equality, let h : {x 1 , . . . , x N } → N be the bijective function provided by the oracle. Now, create a list A of codes so that A[h(x i )] contains the encoding of y i constructed using the distribution p(y|x i ). The length of the prefix code is − log p(y|x i ) + 1 (we need a prefix code so that we can concatenate all the codes). Now, given the distribution p, we can construct a function f such that y i = f (x i ) as follows: Given x, compute h(x), read the code A[h(x i )], and decode it using the distribution p(y|x i ) to obtain the correct y i . Lemma A.1. Fix a probability distribution p(x, y) on X × Y, and assume that p(y|x) is computable. Suppose that D is a collection of N i.i.d. samples (x i , y i ) ∼ p(x, y). Let A x = {(x i , y i ) ∈ D | x i = x}, and letp(y|x) = 1 |Ax| (x i ,y i )∈Ax δ y i ,y be the maximum likelihood estimation (MLE) of p(y|x). Then, for every > 0 there exists c > 0 such that, in the limit N → ∞ we have Proof. Consider p(y|x) andp(y|x) as vectors of size |X × Y|. Expand L D (p) atp: Where p * is on the line connecting p andp. The first term is zero, since the MLE estimationp is by definition a minimum of L D . Now, recall that the MLE converges to the real distribution as √ N (p − p) → N (0, F (p) −1 ), where F (p) is the Fisher Information Matrix computed in p, and ∇ 2 p L D (p) < N (∇ 2 p L D (p) + c ). Let a > 0 be such that, with probability 1 − , p − p < a. Then, Proof of Proposition 3.3. (i) By Shannon's coding theorem, the expected value of K(y|x) is at least N · H p (y|x). Then, the first inequality follows from part (1) of Proposition 3.2. If we use p(y|x) itself in the definition of C(D), we obtain C(D) ≤ L D (p) + K(p(y|x)). The expected value of L D (p) is N · H p (y|x), so we obtain the second inequality.
(ii) We need to prove that, for any other distribution p , we have: L D (p ) + K(p ) > L D (p) + K(p).
In fact, suppose that L D (p ) + K(p ) ≤ L D (p) + K(p). Then we have Notice that we can lower-bound the LHS using the MLE estimatorp, which by definition minimizes L D . By Lemma A.1, we have L D (p) − L D (p) ≥ −c, hence K(p ) ≤ K(p) + c. By the central limit theorem, the LHS grows as N · E x [KL( p(y|x) p (y|x) )] + O( √ N ), so we must have: Therefore E x [KL( p(y|x) p (y|x) )] → 0 as N → ∞. On the other hand, we have that E x [KL( p(y|x) p (y|x) )] ≥ k > 0 for some k, since there are only finitely many distributions p = p such that K(p ) ≤ K(p) + c.
Proof of Lemma 4.2. Positivity follows trivially by the definition. The second property is also easy: d β (D → D) ≤ max p K(p|p) = O(1). We now prove the triangle inequality. In what follows, p i andp i are always β-minimal sufficient statistics of D i . Fix anyp 1 . Choosep 2 such that K(p 2 |p 1 ) is minimized. Then choosep 3 such that K(p 3 |p 2 ) is minimized. Then This holds for everyp 1 . Taking the maximum overp 1 , we obtain the desired result.
Proof of Theorem 4.4. For (x, y) ∈ X × Y and i ∈ {1, 2}, define p 12 (y|x, i) = p i (y|x). Then D 1 D 2 is effectively obtained by sampling from p 12 . By Proposition 3.3, with high probability and for |D 1 | and |D 2 | sufficiently large, we have that p 1 and p 12 are the unique minimal sufficient statistics for D 1 and D 1 D 2 , respectively. By definition of p 12 , we have that K(p 1 |p 12 ) = O(1). Then Proof of Proposition 5.2. We have KL( Q(w|D) P (w) ) = − log(e −K(w * ) /Z) = K(w * )+log(Z).
Proof of Proposition 5.3. For a fixed training algorithm A : D → Q(w|D), we want to find the prior P * (w) that minimizes the expected complexity of the data: Since the KL divergence is always positive, the optimal "adapted" prior is given by P * (w) = Q(w), i.e. the marginal distribution of w over all datasets. Finally, by definition of Shannon's mutual information, we get I(w; D) = KL( Q(w|D) π(D) Q(w) π(D) ) = E D∼π(D) [KL( Q(w|D) Q(w) )].
Proof of Theorem 5.4. Since both P (w) and Q(w|D) are Gaussian distributions, the KL divergence can be written as KL( Q(w|D) P (w) ) = 1 2 where k is the number of components of w. Let w * be a local minimum of the cross-entropy loss L D (p w (y|x)), and let H be the Hessian of L D (p w (y|x)) in w * . Set µ = w * . Assuming that a quadratic approximation holds in a sufficiently large neighborhood, we obtain C β (D; P, Q) = L D (p w * (y|x)) + tr(H · Σ) + β 2 w * 2 λ 2 + 1 λ 2 tr(Σ) + k log λ 2 − log |Σ| − k .
The gradient with respect to Σ is Setting it to zero, we obtain the minimizer Σ * = β 2 (H + β 2λ 2 I) −1 . Recall that the Hessian of the cross-entropy loss coincides with the Fisher information matrix F at w * , because w * is a critical point [10]. Since L D (p w (y|x)), and hence H, is not normalized by the number of samples N , the exact relation is H = N · F . Taking the limit for λ → ∞, we obtain the desired result.
Proof of Proposition 6.4. If Q i is a global minimizer of C β i , then it is at distance at most from a global minimizer Q i+1 of C β i+1 . Therefore, for β = β i+1 , an -local step from Q i reaches a global minimizer Q i+1 of C β i+1 . By induction, the algorithm terminates at a global minimizerQ of Cβ.