Distributional robustness of K-class estimators and the PULSE

Summary: While causal models are robust in that they are prediction optimal under arbitrarily strong interventions, they may not be optimal when the interventions are bounded. We prove that the classical K-class estimator satisﬁes such optimality by establishing a connection between K-class estimators and anchor regression. This connection further motivates a novel estimator in instrumental variable settings that minimizes the mean squared prediction error subject to the constraint that the estimator lies in an asymptotically valid conﬁdence region of the causal coefﬁcient. We call this estimator PULSE (p-uncorrelated least squares estimator), relate it to work on invariance, show that it can be computed efﬁciently, as a data-driven K-class estimator, even though the underlying optimization problem is nonconvex, and prove consistency. We evaluate the estimators on real data and perform simulation experiments illustrating that PULSE suffers from less variability. There are several settings, including weak instrument settings, where it outperforms other estimators.


INTRODUCTION
Learning causal parameters from data has been a key challenge in many scientific fields and has been a long-studied problem in econometrics (e.g., Simon, 1953;Wold, 1954;Goldberger, 1972). Many years after the groundbreaking work by Peirce (1883) and Fisher (1935), causality again plays an increasingly important role in machine learning and statistics, two research areas that are most often considered part of mathematics or computer science (e.g., Spirtes et al., 2000;Pearl, 2009;Imbens and Rubin, 2015;Peters et al., 2017). Even though current research in mathematics and computer science on the one hand and econometrics on the other, does not develop independently, we believe that there is a lot of potential for more fruitful interaction between these two fields. Differences in the language have emerged, which can make communication difficult, but the target of inference, the underlying principles, and the methodology in both fields are closely related. This paper establishes a link between two developments in these fields: K-class estimation, which aims at estimation of causal parameters with good statistical properties, and invariance principles, which are used to build methods that are robust with respect to distributional shifts. This connection allows us to prove distributional robustness guarantees for K-class estimators and motivates a new estimator, PULSE. We summarize our main results in Section 1.2. under additional assumptions, to infer the causal structure, e.g., represented by a graph, from observational (or observational and interventional) data. This problem is sometimes referred to as causal discovery. Constraint-based methods assume that the underlying distribution is Markov and faithful with respect to the causal graph and perform conditional independence tests to infer (parts of) the graph; see, e.g., Spirtes et al. (2000). Score-based methods assume a certain statistical model and optimize (penalized) likelihood scores; see, e.g., Chickering (2002). Some methods exploit a simple form of causal assignments, such as additive noise (e.g., Shimizu et al., 2006;Peters et al., 2014) and others are based on exploiting invariance statements (e.g., Meinshausen et al., 2016;Peters et al., 2016). Many of such methods assume causal sufficiency, i.e., that all causally relevant variables have been observed, but some versions exist that allow for hidden variables; see, e.g., Claassen et al. (2013) and Spirtes et al. (1995).
Recent works in the fields of machine learning and computational statistics (e.g., Schölkopf et al., 2012;Heinze-Deml and Meinshausen, 2017;Pfister et al., 2019) investigate whether causal ideas can help to make machine learning methods more robust. The reasoning is that causal models are robust against any intervention in the following sense. Consider a target or response variable Y and covariates X 1 , . . . , X p . If we regress Y on the set X S , S ⊆ {1, . . . , p}, of direct causes, then this regression function x → E[Y |X S = x] does not change when intervening on any of the covariates (which is sometimes referred to as 'invariance'). This statement can be proved using the local Markov property (Lauritzen, 1996), for example, but the underlying fundamental principle has been discussed already several decades ago; most prominently using the terms 'autonomy' or 'modularity' (Haavelmo, 1944;Aldrich, 1989). As a result, causal models of the form x → E[Y |X S = x] may perform well in prediction tasks, where, in the test distribution, the covariates have been intervened on. If, however, training and test distributions coincide, a model focusing only on prediction and the estimand x → E[Y |X = x] may outperform a causal approach.
The two models described above (OLS and the causal model) formally solve a minimax problem on distributional robustness. Consider therefore an acyclic linear SEM over (Y, X) with observational distribution F . Details on SEMs and interventions can be found in Online Appendix S1. Assume that the assignment for Y equals Y := γ 0 X + ε Y for some γ 0 ∈ R d . The variables corresponding to nonzero entries in γ 0 X are called the parents of Y , and ε Y is assumed to be independent of these parents. Then, the mean squared prediction error when considering the observational distribution is not necessarily minimized by γ 0 , that is, in general, we have γ 0 = γ OLS := arg min γ E F (Y − γ X) 2 . Intuitively, we may improve the prediction of Y by including other variables than the parents of Y , such as its descendants. When considering distributional robustness, we are interested in finding a γ that minimizes the worst case expected squared prediction error over a class of distributions, F, that is, (1.1) If we observe data from all different distributions in F (and know which data point comes from which distribution), we can tackle this optimization directly (Meinshausen and Bühlmann, 2015). But estimators of equation (1.1) may be available even if we do not observe data from each distribution in F . The true causal coefficient γ 0 , for example, minimizes equation (1.1) when F is the set of all possible (hard) interventions on X (e.g., Rojas-Carulla et al., 2018). The OLS solution is optimal when F only contains the training distribution. In this sense, the OLS solution and the true causal coefficient constitutes the end points of a spectrum of estimators that are prediction optimal under a certain class of distributions. Intuitively, models trading off causality and predictability may perform well in situations where the test distribution is only moderately different from the training distribution. Anchor regression by Rothenhäusler et al. (2021) is one approach formalizing this intuition in a linear setup; see Section 2.2 for details. Similarly to an instrumental variable setting, one assumes the existence of exogenous variables called A (for anchor) that may or may not act directly on the target Y . The proposed estimator minimizes a convex combination of the residual sum of squares and the TSLS loss function and is shown to be prediction optimal in the sense of equation (1.1) for a class F containing interventions on the covariates up to a certain strength; this strength depends on a regularization parameter: the weight that is used in the convex combination of anchor regression. Other approaches (Magliacane et al., 2018;Rojas-Carulla et al., 2018;Pfister et al., 2021) search over different subsets S and aim to choose sets that are both invariant and predictive.

Summary and contributions
This paper contains two main contributions: a distributional robustness property of K-class estimators with fixed κ-parameter and a novel estimator for causal coefficients called the p-uncorrelated least squares estimator (PULSE). The following two sections summarize our contributions.

Distributional robustness of K-class estimators.
In Section 2 we show that anchor regression is closely related to K-class estimators. In particular, we prove that, for a restricted subclass of models, K-class estimators can be written as anchor regression estimators. For this subclass, this directly implies a distributional robustness property of K-class estimators. We then prove a similar robustness property for general K-class estimators with a fixed penalty parameter, and show that these properties hold even if the model is misspecified.
Consider a possibly cyclic linear SEM over the variables (Y, X, H, A) of the form subject to regularity conditions that ensure that the distribution of (Y, X, H, A) is well-defined.
Here, B and M are constant matrices, the random vectors A and ε are defined on a common probability space ( , F, P ), Y is the endogenous target for the single equation inference, X are the observed endogenous variables, H are hidden endogenous variables, and A are exogenous variables independent from the unobserved noise innovations ε. SEMs allow for the notion of interventions, i.e., modelling external manipulations of the system. In this work, we are only concerned with interventions on the exogenous variables A of the form do(A := v). Because A is exogeneous, these interventions can be defined as follows: they change the distribution of A to that of a random vector v. The interventional distribution of the variables (Y, X, H, A) under the intervention do(A := v) is given by the simultaneous distribution of (X v , Y v , H v , v) generated by the SEM Thus, the intervention does not change any of the original structural assignments of the endogenous variables. Instead, the change in the distribution of the exogeneous variable propagates through the system. We henceforth let E do(A:=v) denote the expectation with respect to the interventional distribution of the system under the intervention do(A := v). More details on interventions can be found in Online Appendix S1. Let (Y, X, H, A) consist of n row-wise independent and identically distributed copies of the random vector (Y, X, H, A) and consider the single equation of interest The K-class estimator with parameter κ using nonsample information that only Z * ⊂ [X A] have nonzero coefficients in the target equation of interest is given bŷ A is the projection onto the orthogonal complement of the column space of A. For a fixed κ ∈ [0, 1) K-class estimators can be represented by a penalized regression problem α n K (κ) = arg min α l n OLS (α) + κ/(1 − κ)l n IV (α), where l n OLS and l n IV are the empirical OLS and TSLS loss functions, respectively. This representation and the ideas of Rothenhäusler et al. (2021) allow us to prove that K-class estimator converges to a coefficient that is minimax optimal when considering all distributions induced by a certain set of interventions of A. More specifically, we show that for a fixed κ and regardless of identifiability, The argmin on the righthand side minimizes the worst case prediction error when considering interventions up to a certain strength (measured by the set C(κ)). This objective becomes relevant when we consider a response variable with several covariates and aim to minimize the mean squared prediction error of future realizations of the system of interest that do not follow the training distribution. The above result says that if the new realizations correspond to (unknown) interventions on the exogenous variables that are of bounded strength, K-class estimators with fixed κ ∈ (0, 1) minimize the worst case prediction performance and, in particular, outperform the true causal parameter and the least squares solution (see also Figure S8.1 in Online Appendix S8.1). For κ approaching one, we recover the guarantee of the causal solution and for κ approaching zero, the set of distributions contains the training distribution. The above minimax property therefore adds to the discussion whether nonconsistent K-class estimators with penalty parameter not converging to one can be useful; see, e.g., Dhrymes (1974).

The PULSE estimator.
Section 3 contains the second main contribution in this work. We propose a novel data-driven K-class estimator for causal coefficients, which, as mentioned above, we call PULSE. As above, we consider a single endogenous target in an SEM (or simultaneous equation model) and aim to predict it from observed predictors that are with a priori (nonsample) information known to be either endogenous or exogenous. The PULSE estimator can be written in several equivalent forms. It can, first, be seen as a data-driven K-class estimator α n K (λ n /(1 + λ n )) = arg min α l n OLS (α) + λ n l n IV (α), where λ n := inf λ > 0 : testing Corr(A, Y − Zα n K (λ/(1 + λ))) = 0 yields a p-value ≥ p min , for some pre-specified level of the hypothesis test p min ∈ (0, 1). In other words, the PULSE estimator outputs the K-class estimator closest to the OLS while maintaining a nonrejected test of uncorrelatedness. In principle, PULSE can be used with any testing procedure. test, however, may influence the difficulty of the resulting optimization problem. In this paper, we investigate PULSE in connection with a specific class of hypothesis tests that, for example, contain the test of Anderson and Rubin (1949). For these hypothesis tests we develop an efficient and provably correct optimization method that is based on binary line search and quadratic programming.
We show that our estimator can, second, be written as the solution to a constrained optimization problem. To that end, define the primal problemŝ α n Pr (t) := arg min α l n OLS (α) subject to l n IV (α) ≤ t.
For the choice t n := sup{t : testing Corr(A, Y − Zα n Pr (t)) = 0 yields a p-value ≥ p min }, we provide a detailed analysis proving thatα n K (λ n /(1 + λ n )) =α n Pr (t n ). For the testing procedure proposed in this paper, we show that, third, PULSE can be written as where A n (1 − p min ) is the nonconvex acceptance region for our test of uncorrelatedness.
This third formulation allows for a simple interpretation of our estimator: among all coefficients (not restricted to K-class estimators) that do not yield a rejection of uncorrelatedness, we choose the one that yields the best prediction. If the acceptance region is empty it outputs a warning indicating a possible model misspecification or an assumption violation to the user (in that case, one can formally output another estimator such as TSLS or Fuller, yielding PULSE well-defined).
In the just-identified setup, the TSLS estimator solves a normal equation which is equivalent to setting a sample covariance between the instruments and the resulting prediction residuals to zero; it then corresponds to t = 0. For this (and the over-identified) setting, we prove that PULSE is a consistent estimator for the causal coefficient.
The TSLS does not have a finite variance if there is insufficient degree of over-identification, for example. In particular for weak instruments, this usually comes with poor finite sample performance. In such cases, however, the acceptance region of uncorrelatedness is usually large. This yields a weak constraint in the optimization problem and the PULSE will be closer to the OLS, which in certain settings suffers from less variability (see, e.g., Hahn et al., 2004;Hahn and Hausman, 2005). In simulations we indeed see that, similarly to other data-driven K-class estimators that are pulled towards the OLS, such as Fuller estimators, the PULSE comes with beneficial finite sample properties compared to TSLS and LIML.
Unlike other estimators, such as LIML or the classical TSLS, the PULSE is well-defined in under-identified settings, too. Here, its objective is still to find the best predictive solution among all parameters that do not reject uncorrelatedness. Uncorrelatedness to the exogeneous variable is sometimes referred to as invariance. The idea of choosing the best predictive among all invariant models has been investigated in several works (e.g., Magliacane et al., 2018;Rojas-Carulla et al., 2018;Pfister et al., 2021) with the motivation to find models that generalize well (in particular, with respect to interventions on the exogenous variables). Existing methods, however, focus on selecting subsets of variables and then consider least squares regression of the response variable onto the full subset. PULSE can recover such type of solutions if they are indeed optimal, but it also allows us to search over coefficients that are different from least squares regression for sets of variables. Consequently, PULSE allows us to find solutions in situations where the above methods would not find any invariant subsets, which may often be the case if there are hidden variables (see Online Appendix S8.3 for an example).
We show in a simulation study that there are several settings in which PULSE outperforms existing estimators both in terms of MSE ordering and several one-dimensional scalarizations of the MSE. More specifically, we show that PULSE can outperform the TSLS and Fuller estimators in weak instrument situations, for example, where Fuller estimators are known to have good MSE properties; see, e.g., Stock et al. (2002) and Hahn et al. (2004).
Implementation of PULSE and code for experiments (R) are available on GitHub. 1

ROBUSTNESS PROPERTIES OF K-CLASS ESTIMATORS
In this section we consider K-class estimators (Theil, 1958;Nagar, 1959) and show a connection with anchor regression of Rothenhäusler et al. (2021). In Section 2.3.1 we establish the connection in models where we use a priori information that there are no included exogenous variables in the target equation of interest. In Section 2.3.2 we then show that general K-class estimators can be written as the solution to a penalized regression problem. In Section 2.3.3 we utilize this representation and the ideas of Rothenhäusler et al. (2021) to prove a distributional robustness guarantee of general K-class estimators with fixed κ ∈ [0, 1), even under model misspecification and nonidentifiability. Proofs of results in this section can be found in Online Appendix S3.

Setup and assumptions
Denote the random vectors Y ∈ R, X ∈ R d , A ∈ R q , H ∈ R r and ε ∈ R d+1+r by the target, endogenous regressor, anchors, hidden and noise variables, respectively. Let further (Y, X, H ) be generated by the possibly cyclic structural equation model (SEM) for some random vectors ε |= A and constant matrices B and M. Let (Y, X, H, A) consist of n ≥ min{d, q} row-wise independent and identically distributed copies of the random vector (Y, X, H, A). Solving for the endogenous variables we get the structural and reduced form equations [ Y X H ] = AM + ε and [ Y X H ] = A + ε −1 , where := I − B and := M −1 . Assume without loss of generality that has a unity diagonal, such that the target equation of interest is given by where (1, −γ 0 , −η 0 ) ∈ R (1+d+r) , β 0 ∈ R q and ε Y are the first columns of , M and ε ∈ R n×d respectively, Z : The possible dependence between the noiseŨ Y and the endogenous variables, i.e., the influence by hidden variables, generally renders the standard OLS approach for estimating α 0 inconsistent. Instead, one can make use of the components in A that have vanishing coefficient in equation (2.2) for consistent estimation. In the remainder of this work, we disregard any a priori (nonsample) information not concerning the target equation. The question of identifiability of α 0 has been studied extensively (Frisch, 1938;Haavelmo, 1944;Koopmans et al., 1950) and more recent overviews can be found in, e.g., Fisher (1966), Greene (2003), and Didelez et al. (2010).
We will use the following assumptions concerning the structure of the SEM: We will henceforth assume that Assumption 2.1 always holds. This assumption ensure that the SEM and that the TSLS objectives are well-defined. In the above assumptions, Z * and Z * are generic placeholders for a subset of endogenous and exogenous variables from [X A ] and [X A], respectively, which should be clear from the context in which they are used. Both Assumption 2.1 (i) and Assumption 2.2 (c) hold if X and A have density with respect to Lebesgue measure, which in turn is guaranteed by Assumption 2.1 (d) if A and ε have density with respect to Lebesgue measure. Assumption 2.1 (h) and 2.1 (i) implies that the instrumental variable objective functions introduced below are almost surely well-defined, and Assumption 2.2 (c) yields that the OLS solution is almost surely well-defined. Assumption 2.1 (f) implies that Y, X and H all have finite second moments. For Assumption 2.3 (b) and 2.2 (b), it is necessary that q ≥ dim(Z * ), i.e., that the setup must be just-or over-identified; see Section 3.1. Rothenhäusler et al. (2021) proposes a method, called anchor regression, for predicting the endogenous target variable Y from the endogenous variables X. The collection of exogenous variables A, called anchors, are not included in that prediction model. Anchor regression trades off predictability and invariance by considering a convex combination of the OLS loss function and the TSLS (IV) loss function using the anchors as instruments. More formally, we define

Distributional robustness of anchor regression
as the population and finite sample versions of the loss functions. P A = A(A A) −1 A is the orthogonal projection onto the column space of A. To simplify notation, we omit the dependence on Y , X, A, A, X or Y when they are clear from a given context. For a penalty parameter λ > −1, The estimatorγ n AR (λ) consistently estimates the population estimand γ AR (λ) and minimizes prediction error while simultaneously penalizing a transformed sample covariance between the anchors and the resulting prediction residuals. Unlike the TSLS estimator, for example, the anchor regression estimator is almost surely well-defined under the rank condition of Assumption 2.2 (c), even if the model is under-identified, that is, there are less exogenous than endogenous variables. The solution to the empirical minimization problem of anchor regression is given bŷ which follows from solving the normal equation of equation (2.6). The motivation of anchor regression is not to infer a causal parameter. Instead, for a fixed penalty parameter λ, the estimator is shown to possess a distributional or interventional robustness property: the estimator is optimal when predicting under interventions on the exogenous variables that are below a certain intervention strength. By theorem 1 of Rothenhäusler et al. (2021) it holds that

Distributional robustness of K-class estimators
We now introduce the limited information estimators known as K-class estimators (Theil, 1958;Nagar, 1959) used for single equation inference. Suppose that we are given nonsample information about which components of γ 0 and β 0 of equation (2.2) are zero. We can then partition , where X − * and A − * correspond to the variables for which our nonsample information states that the components of γ 0 and β 0 are zero, respectively. We call the variables corresponding to A * included exogenous variables. Similarly, we write γ 0 = (γ 0, * , γ 0,− * ), In the case that the nonsample information is indeed correct, we have that U Y =Ũ Y = Hη 0 + ε Y . When well-defined, the K-class estimator with parameter κ ∈ R for a simultaneous estimation of α 0, * is given bŷ Comparing equations (2.7) and (2.8) suggests a close connection between anchor regression and K-class estimators for inference of structural equations with no included exogenous variables. In the following subsections, we establish this connection and subsequently extend the distributional robustness property to general K-class estimators. C The Author(s) 2022.

K-class estimators in models with no included exogenous variables.
Assume that, in addition to Assumption 2.1, we have the nonsample information that β 0 = 0, that is, no exogenous variable in A directly affects the target variable Y . By direct comparison we see that the Kclass estimator for κ < 1 coincides with the anchor regression estimator with penalty parameter Equivalently, we haveγ n AR (λ) =γ n K (λ/(1 + λ)) for any λ > −1. As such, the K-class estimator, for a fixed κ < 1, inherits the following distributional robustness property:γ }. This statement holds by theorem 1 of Rothenhäusler et al. (2021).
In an identifiable model with P lim n→∞ κ = 1 we have thatγ n K (κ) consistently estimates the causal parameter; see e.g., Mariano (2001). For such a choice of κ, the robustness above is just a weaker version of what the causal coefficient can guarantee. However, the above result in equation (2.9) establishes a robustness property for fixed κ < 1, even in cases where the model is not identifiable. Furthermore, since we did not use the nonsample information that β 0 = 0 was true, the robustness property is resilient to model misspecification in terms of excluding included exogenous variables from the target equation, which generally also breaks identifiability.

The K-class estimators as penalized regression estimators.
We now show that general Kclass estimators can be written as solutions to penalized regression problems. The first appearance of such a representation is, to the best of our knowledge, due to McDonald (1977) building upon previous work of Basmann (1960a, b). Their representation, however, concerns only the endogenous part γ . We require a slightly different statement and will show that the entire Kclass estimator of α 0, * , i.e., the simultaneous estimation of γ 0, * and β 0, * , can be written as a penalized regression problem. Let therefore l IV (α; Y, Z * , A), l n IV (α; Y, Z * , A) and l OLS (α; Y, Z * ), l n OLS (α; Y, Z * ) denote the population and empirical TSLS and OLS loss functions as defined in equations (2.3) to (2.4). That is, the TSLS loss function for regressing Y on the included endogenous and exogenous variables Z * using the exogeneity of A and A − * as instruments and the OLS loss function for regressing Y on Z * . We define the K-class population and finite sample loss functions as an affine combination of the two loss functions above. That is, (2.12) Assuming κ = 1, we can rewrite equation (2.12) tô Thus, K-class estimators seek to minimize the OLS loss for regressing Y on Z * , while simultaneously penalizing the strength of a transform on the sample covariance between the prediction residuals and collection of exogenous variables A.
In the following section, we consider a population version of the above quantity. If we replace the finite sample Assumption 2.2 with the corresponding population Assumption 2.3, we get that the minimization estimator of the empirical loss function of equation (2.11) is asymptotically well-defined. Furthermore, we now prove that whenever the population assumptions are satisfied, then, for any fixed κ ∈ [0, 1],α n K (κ; Y, Z * , A) converges in probability towards the population K-class estimand.

Distributional robustness of general K-class estimators.
We are now able to prove that the general K-class estimator possesses a robustness property similar to the statements above. It is prediction optimal under a set of interventions on all exogenous A up to a certain strength.
THEOREM 2.1. Let Assumption 2.1 hold. For any fixed κ ∈ [0, 1) and Z * = (X * , A * ) with X * ⊆ X and A * ⊆ A, we have, whenever the population K-class estimand is well-defined, that Here, E do(A:=v) denotes the expectation with respect to the distribution entailed under the intervention do(A := v) (see Section 1.2.1 and Online Appendix S1) and ( , F, P ) is the common background probability space on which A and ε are defined.
In other words, among all linear prediction methods of Y using Z * as predictors, the K-class estimator with parameter κ has the lowest possible worst case mean squared prediction error when considering all interventions on the exogenous variables A contained in C(κ). As κ approaches one, the estimator is prediction optimal under a class of arbitrarily strong interventions in the direction of the variance of A. (Here, κ is arbitrary but fixed; the statement does not cover datadriven choices of κ, such as LIML or Fuller.) The above result is a consequence of the relation between anchor regression and K-class estimators. The special case A * = ∅ is a consequence of theorem 1 by Rothenhäusler et al. (2021). Our proof follows similar arguments, but additionally allows for A * = ∅.
The property in Theorem 2.1 has a decision-theoretic interpretation (see Chamberlain, 2007 for an application of decision theory in IV models based on another loss function). Consider a response Y , covariates Z * and a distribution (specified by θ ) over (Y, Z * ), and the squared loss (Y, Z, α) := (Y − α Z * ) 2 . Then, assuming finite variances, for each distribution the risk E θ [(Y − α Z * ) 2 ] is minimized by the (population) OLS solution α = α θ := cov θ (Z * ) −1 cov θ (Z * , Y ). In the setting of Theorem 2.1, we are given a distribution over (Y, Z * ), specified by θ , but we are interested in minimizing the risk E θ,v [(Y − α θ Z * ) 2 ] for another distribution that is induced by an intervention C The Author(s) 2022. and specified by (θ, v). The above result states that the K-class estimator minimizes a worst case risk when considering all v ∈ C(κ).
Theorem 2.1 makes use of the language of SEMs in that it yields the notion of interventions. 2 As such, the result can be rephrased using other causal frameworks. The crucial assumptions are the exogeneity of A and the linearity of the system. Furthermore, the result is robust with respect to several types of model misspecifications that break the identifiability of α 0 , such as excluding included endogenous or exogenous predictors or the existence of latent mediators between the exogeneous variables and the target; see Remark S7.1 in Online Appendix S7.

THE P-UNCORRELATED LEAST SQUARE ESTIMATOR
We now introduce the PULSE. As discussed in Section 1.2, PULSE allows for different representations. In this section we start with the third representation and show the equivalence of the other representations afterwards.
Consider predicting the target Y from endogenous and possibly exogenous regressors Z. Let therefore H 0 (α) denote the hypothesis that the prediction residuals using α as a regression coefficient is simultaneously uncorrelated with every exogenous variable, that is, H 0 (α) : Corr(A, Y − α Z) = 0. This hypothesis is in some models under certain conditions equivalent to the hypothesis that α is the true causal coefficient. One of these conditions is the rank condition Assumption 3.5 introduced below, also known as the rank condition for identification; see Wooldridge (2010).
The two-stage least square (TSLS) estimator exploits the equivalence between the causal coefficient and the zero correlation between the instruments and the regression residuals. Here, one minimizes a sample covariance between the instruments and the regression residuals: we can write l n IV (α; Y, Z, A) = Cov n (A, Y − α Z) 2 (n −1 A A) −1 when A is mean zero. 3 In the justidentified setup the TSLS estimator yields a sample covariance that is exactly zero and is known to be unstable, in that it has no moments of any order. Intuitively, the constraint of vanishing sample covariance may be too strong.
Let T (α; Y, Z, A) be a finite sample test statistic for testing the hypothesis H 0 (α) and let p-value(T (α; Y, Z, A)) denote the p-value associated with the test of H 0 (α). We then define the PULSE asα n PULSE (p min ) = argmin α l n OLS (α; Y, Z) subject to p-value(T (α; Y, Z, A)) ≥ p min , (3.14) where p min is a pre-specified level of the hypothesis test. In other words, we aim to minimize the mean squared prediction error among all coefficients which yield a p-value for testing H 0 (α) that does not fall below some pre-specified level-threshold p min ∈ (0, 1), such as p min = 0.05. That is, the minimization is constrained to the acceptance region of the test, i.e., a confidence region for the causal coefficient in the identified setup. Among these coefficient, we choose the solution that is 'closest' to the OLS solution. 4 2 In particular, we have not considered the SEM as a model for counterfactual statements. 3 · (n −1 A A) −1 is the norm induced by the inner product x, y = x (n −1 A A) −1 y. 4 Here, closeness is measured in the OLS distance: we define the OLS norm via α 2 OLS := l n OLS (α +α n OLS ) − l n OLS (α n OLS ) = α Z T Zα, whereα n OLS is the OLS estimator. This defines a norm (rather than a semi-norm) if Z T Z is nondegenerate. Minimizing l n Thus, PULSE allows for an intuitive interpretation. We will see in the experimental section that it has good finite sample performance, in particular for weak instruments. Unlike other estimators, such as LIML, the above estimator is well-defined in the under-identified setup, too. 5 In such cases, PULSE extends existing literature that aims to trade off predictability and invariance, but that so far has been restricted to search over subsets of variables (see Sections 1.2.2 and Online Appendix S8.3). To maintain consistency of the estimator, the chosen test must have asymptotic power of one.
In this paper, we propose a class of significance tests that contains, e.g., the Anderson-Rubin test (Anderson and Rubin, 1949). While the objective function in equation (3.14) is quadratic in α, the resulting constraint is, in general, nonconvex. In Section 3.5, we develop a computationally efficient procedure that provably solves the optimization problem at low computational cost. Other choices of tests are possible, too, but may result in even harder optimization problems.
In Section 3.1 we briefly introduce the setup and assumptions. In Section 3.2 we specify a class of asymptotically consistent tests for H 0 (α). In Section 3.3 we formally define the PULSE estimator. In Section 3.4 we show that the PULSE estimator is well-defined by proving that it is equivalent to a solvable convex quadratically constrained quadratic program, which we denote by the primal PULSE. In Section 3.5 we utilize duality theory and derive an alternative representation, which we denote by the dual PULSE. This representation yields a computationally feasible algorithm and shows that the PULSE estimator is a K-class estimator with a datadriven κ. Proofs of results in this section can be found in Online Appendix S5 unless stated otherwise.

Setup and assumptions
In the following sections we again let (Y, X, H, A) consist of n ≥ min{d, q} row-wise independent and identically distributed copies of (Y, X, H, A) generated in accordance with the SEM in equation (2.1). The structural equation of interest is Y = γ 0 X + η 0 H + β 0 A + ε Y . Assume that we have some nonsample information about which d 2 = d − d 1 and q 2 = q − q 1 coefficients of γ 0 and β 0 , respectively, are zero. As in Section 2, we let the subscript * denote the variables and coefficients that are nonzero according to the nonsample information but, to simplify notation, we drop the * subscript from Z, Z and α 0 ; so, we write Z = [X * A * ] ∈ R d 1 +q 1 , Z = [X * A * ] ∈ R n×(d 1 +q 1 ) and α 0 := (γ 0, * , β 0, * ) :∈ R d 1 +q 1 . That is, We define a setup as being under-, just-, and over-identified by the degree of over-identification q 2 − d 1 being negative, equal to zero, and positive, respectively. That is, the number of excluded exogenous variables A − * being less, equal, or larger than the number of included endogenous variables X * in the target equation.
We assume that the global assumptions of Assumption 2.1 from Section 2.1 still hold. Furthermore, we will make use of the following situational assumptions.
ASSUMPTION 3.2. ε has nondegenerate marginals. 5 The PULSE estimator is defined for finite samples, but the following deliberation may help to build intuition: in an under-identified IV setting, minimizing l OLS (γ ) under the constraint that l IV (γ ) = 0, can be seen as choosing, under all causal models compatible with the distribution, the model with the least amount confounding -when using E(Y − γ X) 2 − E(Y − γ OLS X) 2 as a measure for confounding. C The Author(s) 2022.

ASSUMPTION 3.5. E[AZ ] is of full rank.
Assumption 3.1 (a) holds if our nonsample information is true and the instrument set A is independent of all unobserved endogenous variables H i which directly affect the target Y . This holds, for example, if the latent variables are source nodes; that is, they have no parents in the causal graph of the corresponding SEM. Assumption 3.1 (b) can be achieved by centring the data. Strictly speaking, this introduces a weak dependence structure in the observations, which is commonly ignored. Alternatively, one can perform sample splitting. For more details on this assumption and the possibility of relaxing it, see Remark 3.1. Assumption 3.3 (a) ensures that K-class estimators for κ < 1 are well-defined, regardless of the over-identification degree. In the under-identified setup, Assumption 3.3 (b) yields that there exists a subspace of solutions minimizing l n IV (α). In the just-and over-identified setup, this assumption ensures that l n IV (α) has a unique minimizer given by the TSLS estimatorα n TSLS := (Z P A Z) −1 Z P A Y. Assumption 3.4 is used to ensure that the OLS objective function α → l n OLS (α; Y, Z) is strictly positive, such that division by this function is always well-defined. Assumptions 3.2 and 3.5 ensure that various limiting arguments are valid. In the just-and over-identified setup, Assumption 3.5 is known as the rank condition for identification.

Testing for vanishing correlation
We now introduce a class of tests for the null hypothesis H 0 (α) : Corr(A, Y − Zα) = 0 that have point-wise asymptotic level and point-wise asymptotic power. These tests will allow us to define the corresponding PULSE estimator. When Assumption 3.4 holds, we can define T c n : R d 1 +q 1 → R by where c(n) is a function that will typically scale linearly in n. Let us denote the 1 − p quantile of the central Chi-Squared distribution with q degrees of freedom by Q χ 2 q (1 − p). By standard limiting theory we can test H 0 (α) in the following manner. Let Assumptions 3.1, 3.2, and 3.4 hold and assume that c(n) ∼ n as n → ∞. For any p ∈ (0, 1) and any fixed α, the statistical test rejecting the null hypothesis H 0 (α), if T c n (α) > Q χ 2 q (1 − p), has point-wise asymptotic level p and point-wise asymptotic power of 1 against all alternatives as n → ∞. Depending on the choice of c(n), this class contains several tests, some of which are well known. With c(n) = n − q + Q χ 2 q (1 − p min ), for example, one recovers a test that is equivalent to the asymptotic version of the Anderson-Rubin test (Anderson and Rubin, 1950). We make this connection precise in Remark S7.2 in Online Appendix S7. The Anderson-Rubin test is robust to weak instruments in the sense that the limiting distribution of the test statistic under the null-hypothesis is not affected by weak instrument asymptotics; see, e.g., Staiger and Stock (1997) and Stock et al. (2002). 6 For weak instruments, the confidence region may be unbounded with large probability; see Dufour (1997). Moreira (2009) shows that the test suffers from loss of power in the over-identified setting.

LEMMA 3.1 (LEVEL AND POWER OF THE TEST).
To simplify notation, we will from now on work with the choice c(n) = n and define the acceptance region with level p min ∈ (0, 1) as A n (1 − p min ) := {α ∈ R d 1 +q 1 : T n (α) ≤ Q χ 2 q (1 − p min )}, where T n (α) corresponds to the choice c(n) = n.

The PULSE estimator
For any level p min ∈ (0, 1), we formally define the PULSE estimator of equation (3.14) by letting the feasible set be given by the acceptance region A n (1 − p min ) of H 0 (α) using the test of Lemma 3.1. That is, we consider α n PULSE (p min ) := arg min α l n OLS (α) subject to T n (α) ≤ Q χ 2 q (1 − p min ). (3.15) In general, this is a nonconvex optimization problem (Boyd and Vandenberghe, 2004) as the constraint function is nonconvex; see the blue contours in Figure 1(left). From Figure 1(right) we see that in the given example the problem nevertheless has a unique and well-defined solution: the smallest level set of l n OLS with a nonempty intersection of the acceptance region {α : T n (α) ≤ Q χ 2 q (1 − p min )} intersects with the latter region in a unique point. In Section 3.4, we prove that this is not a coincidence: equation (3.15) has a unique solution that coincides with the solution of a strictly convex, quadratically constrained quadratic program (QCQP) with a data-dependent constraint bound. In Section 3.5, we further derive an equivalent Lagrangian dual problem. This has two important implications: (1) it allows us to construct a computationally efficient procedure to compute a solution of the nonconvex problem above; and (2) it shows that the PULSE estimator can be written as K-class estimators.
Estimators with similar constraints, albeit different optimization objectives, have been studied by Gautier et al. (2018). In Remark S7.3 in Online Appendix S7 we briefly discuss the connection to pre-test estimators. Furthermore, any method for inverting the test (see, e.g., Davidson and MacKinnon, 2014), yields a valid confidence set including the proposed point estimator (given that the method outputs the point estimator when the acceptance region is empty).

Primal representation of PULSE
We now derive a QCQP representation of the PULSE problem, which we call the primal PULSE. For all t ≥ 0 define the empirical primal minimization problem (Primal.t.n) by minimize α l n OLS (α; Y, Z) subject to l n IV (α; Y, Z, A) ≤ t.
(3.16) 6 Weak instrument asymptotics is a model scheme where the instrument strength tends to zero at a rate of n −1/2 , i.e., the reduced form structural equation for the endogenous variables is given by X = An −1/2 X + ε −1 X .

Figure 1.
Illustrations of the level sets of l n OLS (red contours), the proposed test statistic T n (blue contours), and l n IV (green contours) in a just-identified setup. The example is generated with a two dimensional anchor A = (A 1 , A 2 ), one of which is included, and one included endogenous variable X, i.e., Y = α 1 X + α 2 A 1 + H + ε Y with (α 1 , α 2 ) = (1, 1). Both illustrations show level sets from the same setup, but they use different scales. The black text denotes the level of the test-statistic contours. In this setup, the PULSE constraint bound, the rejection threshold of the test with p min = 0.05, is Q χ 2 2 (0.95) ≈ 5.99. The blue level sets of T n are nonconvex. The sublevel set of the test, corresponding to the acceptance region, is illustrated by the blue area. In the right plot, we see that the smallest level set of l n OLS that has a nonempty intersection with the Q χ 2 q (1 − p min )-sublevel set of T n is a singleton (black dot, t * (p)). This shows that in this example the PULSE problem is solvable and has a unique solution. The l n IV level set that intersects this singleton is exactly the t n (p min )-level set of l n IV , illustrating the statement of Theorem 3.1 in that the primal PULSE with that choice of t solves the PULSE problem. The black line visualizes the solutions {α n Pr (t) : t ∈ D Pr }. The black points and corresponding text labels indicate which constraint bound t yields the specific point. In general, the class of primal solutions does not coincide with the class of convex combinations of the OLS and the TSLS estimators.
We drop the dependence of Y, Z, and A and refer to the objective and constraint functions as l n OLS (α) and l n IV (α). The following lemma shows that under suitable assumptions these problems are solvable, strictly convex QCQP problems satisfying Slater's condition.
LEMMA 3.2 (UNIQUE SOLVABILITY OF THE PRIMAL). Let Assumption 3.3 hold. It holds that α → l n OLS (α) and α → l n IV (α) are strictly convex and convex, respectively. Furthermore, for any t > inf α l n IV (α) it holds that the constrained minimization problem (Primal.t.n) has a unique solution and satisfies Slater's condition. In the under-and just-identified setup the constraint bound requirement is equivalent to t > 0 and in the over-identified setup to t > l n IV (α n TSLS ), wherê α n TSLS = (Z P A Z) −1 Z P A Y.
We restrict the constraint bounds to D Pr := (inf α l n IV (α), l n IV (α n OLS )]. Considering t that are larger than inf α l n IV (α) ensures that the problem (Primal.t.n) is uniquely solvable and furthermore that Slater's condition is satisfied (see Lemma 3.2 above). Slater's condition will play a role in Section 3.5 when establishing a sufficiently strong connection with its corresponding dual problem for which we can derive a (semi-)closed form solution. Constraint bounds greater than or equal to l n IV (α n OLS ) yield identical solutions. Whenever well-defined, letα n Pr : D Pr → R d 1 +q 1 denote the constrained minimization estimator given by the solution to the (Primal.t.n) problem α n Pr (t) := arg min α l n OLS (α) subject to l n IV (α) ≤ t. (3.17) We now prove that for a specific choice of t, the PULSE and the primal PULSE yield the same solutions. Define t n (p min ) as the data-dependent constraint bound given by If t n (p min ) > −∞ or equivalently t n (p min ) ∈ D Pr we define the primal PULSE problem and its solution by (Primal.t n (p min ).n) andα n Pr (t n (p min )). The following theorem yields conditions for when the solutions to the primal PULSE and PULSE problems coincide. THEOREM 3.1 (PRIMAL REPRESENTATION OF PULSE). Let p min ∈ (0, 1) and Assumptions 3.3 and 3.4 hold and assume that t n (p min ) > −∞. If T n (α n Pr (t n (p min ))) ≤ Q χ 2 q (1 − p min ), then the PULSE problem has a unique solution given by the primal PULSE solution. That is, α n PULSE (p min ) =α n Pr (t n (p min )). We show that t n (p min ) > −∞ is a sufficient condition for T n (α n Pr (t n (p min ))) ≤ Q χ 2 q (1 − p min ) in the proof of Theorem 3.2. The sufficiency of t n (p min ) > −∞ is postponed to the latter proof as it easily follows from the dual representation. Hence, we have shown that finding the PULSE estimator, i.e., finding a solution to the nonconvex PULSE problem, is equivalent to solving the convex QCQP primal PULSE for a data-dependent choice of t n (p min ). 7 However, t n (p min ) is still unknown. Figure 1 shows an example of the equivalence in Theorem 3.1. Figure 1(right) shows that the level set of l IV (α) = t (p min ) intersects the optimal level curve of l n OLS (α) in the same point given by minimizing over the constraint T n (α) ≤ Q χ 2 q (1 − p min ). The set of solutions to the primal problem {α n Pr (t) : t ∈ D Pr } can in the just-and over-identified setup be visualized as an (in general) nonlinear path in R d 1 +q 1 between the TSLS estimator (t = l n IV (α n TSLS )) and the OLS estimator (t = l n IV (α n OLS )) (see also Rothenhäusler et al., 2021). Theorem 3.1 yields that the PULSE estimator (t = t n (p min )) then seeks the estimator 'closest' to the OLS estimator along this path that does not yield a rejected test of simultaneous vanishing correlation between the resulting prediction residuals and the exogenous variables A; see Figure 1. The path of possible solutions is not necessarily a straight line (see black line); thus, in general, the PULSE estimator is different from the affine combination of OLS and TSLS estimators studied by, e.g., Judge and Mittelhammer (2012).
In the under-identified setup, the TSLS end point corresponding to t = min α l n IV (α) is instead given by the point in the IV solution space {α ∈ R d 1 +q 1 : l n IV (α) = 0} with the smallest mean squared prediction residuals.

Dual representation of PULSE
In this section, we derive a dual representation of the primal PULSE problem, which we will denote the dual PULSE problem. This specific dual representation allows for the construction of a binary search algorithm for the PULSE estimator and yields that PULSE is a member of the K-class estimators with stochastic κ-parameter.
Under Assumption 3.3 (b) we have that the infimum of l n IV (α) is attainable (see the proof of Lemma 3.2). Hence, let the solution space for the minimization problem min α l n IV (α) be given by M IV := arg min α l n IV (α) = {α ∈ R d 1 +q 1 : l n IV (α) = inf α l n IV (α )}. (3.20) In the under-identified setup (q 2 < d 1 ), M IV is a (d 1 − q 2 )-dimensional subspace of R d 1 +q 1 and in the just-and over-identified setup it holds that M IV = {α n TSLS }. We now prove that, in the generic case, K-class estimators for λ ∈ [0, ∞) are different from the TSLS estimator. This result may not come as a surprise, but we include it as we need the result later and have not found it elsewhere. LEMMA 3.3 (K-CLASS ESTIMATORS AND TSLS DIFFER). Assume that we are in the justor over-identified setup and n > q. Furthermore, assume that ε has density with respect to Lebesgue measure and that the coefficient matrix B of the SEM in equation (2.1) is lower triangular. If the rank conditions of Assumption 3.3 hold almost surely, then it almost surely holds that all K-class estimators with penalty parameter λ ∈ [0, ∞) differ from the TSLS estimator, i.e., α n TSLS ∈ {α n K (λ) : λ ≥ 0}.
We conjecture that the corresponding statement holds in the under-identified setup and without the lower triangular assumption on B, too. That is, M IV ∩ {α n K (λ) : λ ≥ 0} = ∅ holds almost surely. We therefore introduce this as an assumption.
The above corollary is proven as Corollary S4.1 in Online Appendix S4. We now show that the class of K-class estimators with penalty parameter λ ≥ 0, i.e., κ ∈ [0, 1), coincides with the class of constrained minimization estimators that minimize the primal problems with constraint bounds t > min α l n IV (α). If Assumptions 3.3, 3.4, and 3.6 hold, then both of the following statements hold: (a) for any t ∈ D Pr , there exists a unique λ(t) ≥ 0 such that (Primal.t.n) and (Dual .λ(t).n) have the same unique solution; (b) for any λ ≥ 0, there exists a unique t(λ) ∈ D Pr such that (Primal .t(λ).n) and (Dual .λ.n) have the same unique solution.

LEMMA 3.4 (CONNECTING THE PRIMAL AND DUAL).
Lemma 3.4 tells us that, under appropriate assumptions, In other words, we have recast the K-class estimators with κ ∈ [0, 1) as the class of solutions to the primal problems previously introduced. That the minimizers of l n IV (α) are different from all the K-class estimators with penalty λ ≥ 0 (or κ ∈ [0, 1)) guarantees that when representing a K-class problem in terms of a constrained optimization problem it satisfies Slater's condition.
We are now able to show the main result of this section. The PULSE estimatorα n PULSE (p min ) solves a K-class problem (Dual.λ.n) and can therefore be seen as a K-class estimator with a data-dependent parameter. To see this, let us define the dual PULSE penalty parameter, i.e., the dual analogue of the primal PULSE constraint t n (p min ) as λ n (p min ) := inf{λ ≥ 0 : T n (α n K (λ)) ≤ Q χ 2 q (1 − p min )}.
( 3.21) If λ n (p min ) < ∞, we define the dual PULSE problem by (Dual.λ n (p min ).n) with solution α n K (λ n (p)) = arg min α∈R d 1 +q 1 l n OLS (α) + λ n (p min )l n IV (α). THEOREM 3.2 (DUAL REPRESENTATION OF PULSE). Let p min ∈ (0, 1) and Assumptions 3.3, 3.4, and 3.6 hold. If λ n (p min ) < ∞, then it holds that t n (p min ) > −∞ andα n K (λ n (p min )) = α n Pr (t n (p min )) =α n PULSE (p min ). Thus, the PULSE estimator seeks to minimize the K-class penalty λ, i.e., to pull the estimator along the K-class path {α n K (λ) : λ ≥ 0} as close to the OLS estimator as possible. Furthermore, the statement implies that the PULSE estimator is a K-class estimator with data-driven penalty λ n (p min ) or, equivalently, parameter κ = λ n (p min )/(1 + λ n (p min )). Given a finite dual PULSE penalty parameter λ n (p min ) we can, by utilizing the closed form solution of the K-class problem, represent the PULSE estimator in the following form: α n PULSE (p min ) =α n K (λ n (p min )) = (Z (I + λ n (p min )P A )Z) −1 Z (I + λ n (p min )P A )Y. However, to the best of our knowledge, λ n (p min ) has no known closed form, so the above expression cannot be computed in closed form either. In Section 3.5.1, we prove that the PULSE penalty parameter λ n (p min ) can be approximated with arbitrary precision by a simple binary search procedure.
The following lemma provides a necessary and sufficient (in practice, checkable) condition for when the PULSE penalty parameter λ n (p min ) is finite.
LEMMA 3.5 (INFEASIBILITY OF THE DUAL REPRESENTATION). Let p min ∈ (0, 1) and Assumptions 3.3,3.4,and 3.6 hold. In the under-and just-identified setup we have that λ n (p min ) < ∞. In the over-identified setup it holds that λ n (p min ) < ∞ ⇐⇒ T n (α n TSLS ) < Q χ 2 q (1 − p min ). This is not guaranteed to hold as the event that A n (1 − p min ) = ∅ can have positive probability.
Thus, under suitable regularity assumptions Lemma 3.5 yields that our dual representation of the PULSE estimator always holds in the under-and just-identified setup. It furthermore yields a sufficient and necessary condition for the dual representation to be valid in the over-identified setup, namely that the TSLS is in the interior of the acceptance region. Furthermore, this condition is possibly violated in the over-identified setup with nonnegligible probability. C The Author(s) 2022.
The above lemma is proven as Lemma S4.1 in Online Appendix S4. If the OLS solution is not strictly feasible in the PULSE problem, then λ n (p min ) indeed is the smallest penalty parameter for which the test statistic reaches a p-value of exactly p min ; see Lemma S4.2 in Online Appendix S4.
We propose the binary search algorithm presented in Algorithm 1 in Online Appendix S2, that can approximate a finite λ n (p min ) with arbitrary precision. 8 We terminate the binary search (see line 2) if λ n (p min ) is not finite, in which case we have no computable representation of the PULSE estimator. We now prove that Algorithm 1 achieves the required precision and is asymptotically correct.

Algorithm and consistency
The dual representation of the PULSE estimator is not guaranteed to be well-defined in the over-identified setup. In particular, it is not well-defined if the TSLS is outside the interior of the acceptance region (which corresponds to a p-value of less than or equal to p min ). In this case, we propose to output a warning. This can be helpful information for the user, since it may indicate a model misspecification. For example, if the true relationship is in fact nonlinear, and one considers an over-identified case (e.g., by constructing different transformations of the instrument), even the TSLS may be rejected when erroneously considering a linear model; see Keane (2010) and Mogstad and Wiswall (2010). For any p min ∈ (0, 1) we can still define an always well-defined modified PULSE estimatorα n PULSE+ (p min ) asα n PULSE (p min ) if the dual representation is feasible, and some other consistent estimatorα n ALT (such as the TSLS, LIML, or Fuller estimator) otherwise. That is, we definê Similarly to the case of an empty rejection region, we also output a warning for the case when the OLS estimator is accepted. This may, but does not have to, indicate weak instruments. Thus, we have the algorithm presented as Algorithm 2 in Online Appendix S2 for computing the PULSE+ estimator.
We now prove that the PULSE+ estimator consistently estimates the causal parameter in the just-and over-identified setting. Assume that we choose a consistent estimatorα n ALT (under standard regularity assumptions, this is satisfied for the TSLS). 9 We can then show that, under mild conditions, the PULSE+ estimator, too, is a consistent estimator of α 0 . THEOREM 3.3 (CONSISTENCY OF PULSE +). Consider the just-or over-identified setup and let p min ∈ (0, 1). If Assumptions 3. 1, 3.3, 3.4, 3.5, and 3.6 hold almost surely for all n ∈ N and α n ALT consistently estimates α 0 , thenα n PULSE+ (p min ) p → α 0 , when n → ∞.
We believe that a similar statement also holds in the under-identified setting; see Online Appendix S8.3.

SIMULATION EXPERIMENTS
In Online Appendix S8 we conduct an extensive simulation study investigating the finite sample behaviour of the PULSE estimator. The concept of weak instruments is central to our analysis. An introduction to weak instruments can be found in Online Appendix S10. Here, we give a brief overview of the study and the observations.

Distributional robustness
The theoretical results on distributional robustness proved in Section 2 translate to finite data. The experiments in Online Appendix S8.1 shows that, even for small sample sizes, K-class estimators outperform both OLS and TSLS for a certain range of interventions, matching the theoretical predictions with increasing sample size. In Online Appendix S8.3 we also consider an under-identified setting.

Estimating causal effects
When focusing on the estimation of a causal effect in an identified setting, our simulations show that there are several settings where PULSE outperforms the Fuller and TSLS estimators in terms of MSE. In univariate simulation experiments, such settings are characterized by weakness of instruments and weak confounding (endogeneity). The characterization becomes more involved in multivariate settings, but is similar in that PULSE outperforms all other methods for small confounding strengths, an effect amplified by the weakness of instruments. Below we detail the univariate simulation setup and refer the reader to Online Appendix S8 for further details and the multivariate simulation experiments mentioned above.

Univariate model.
We first compared performance measures of the estimators in a univariate instrumental variable model. As seen in Hahn and Hausman (2002) and Hahn et al. (2004), we consider structural equation models of the form . Furthermore, we let γ = 1 andξ = (ξ, ...., ξ ) ∈ R q , where ξ > 0 is chosen according to the theoretical R 2 -coefficient. We consider the following simulation scheme: for each q ∈ {1, 2, 3, 4, 5, 10, 20, 30}, ρ ∈ 9 This holds asα n TSLS = α 0 + (n −1 Z A(n −1 A A) −1 n −1 A Z) −1 n −1 Z A(n −1 A A) −1 n −1 A U Y .  Figure 2 contains illustrations of the relative change in square-root mean squared error (RMSE) estimated from 15000 repetitions. On the horizontal axis we have plotted the average first stage F-test as a measure of weakness of instruments; see Online Appendix S10 for further details. A test for H 0 :ξ = 0, i.e., for the relevancy of instruments, at a significance level of 5% has different rejection thresholds in the range [1.55,4.04] depending on n and q. The vertical dashed line corresponds to the smallest rejection threshold of 1.55 and the dotted line corresponds to the 'rule of thumb' threshold of 10. Note that the lowest possible negative relative change is −1 and a positive relative change means that PULSE is better.
In Online Appendix S11, further illustrations of, e.g., the relative change in mean bias and variance of the estimators are presented. We also conducted the simulations for setups with combinations of γ ∈ {−1, 0}, components ofξ chosen negatively, with random flipped sign in each coordinate, and for negative ρ (not shown but available in the folder 'Plots' in the code repository). The results with respect to MSE are similar to those shown in Figure 2, while the bias comparison changes depending on the setup.
We observe that there are settings in which the PULSE is superior to TSLS, Fuller(1), and Fuller(4) in terms of MSE. This is particularly often the case in weak instrument settings (Ê N (G n ) < 10) for low confounding strength (ρ ≤ 0.2). Furthermore, as we tend towards the weakest instrument setting considered, we also see a gradual shift in favour of PULSE for higher confounding strengths. In these settings with weak instruments and low confounding we also see that OLS is superior to the PULSE in terms of MSE. However, for large confounding setups PULSE is superior to OLS in terms of both bias and MSE, and this superiority increases as the instrument strength increases. The PULSE is generally more biased than the Fuller and TSLS C The Author(s) 2022.
Downloaded from https://academic.oup.com/ectj/article/25/2/404/6380481 by Det Kongelige Bibliotek user on 06 July 2022 estimators, but less biased than OLS. However, in the settings with weak instruments and low confounding, the bias of PULSE and OLS is comparable. In summary, the PULSE is, in these settings, more biased but its variance is so small that it is MSE superior to the Fuller and TSLS estimators.

EMPIRICAL APPLICATIONS
We now consider three classical instrumental variable applications; see Albouy (2012) and Buckles and Hungerman (2013) for discussions on the underlying assumptions.
(i) 'Does compulsory school attendance affect schooling and earnings?' by Angrist and Krueger (1991). This paper investigates the effects of education on wages. The endogenous effect of education on wages are remedied by instrumenting education on quarter of birth indicators. (ii) 'Using geographic variation in college proximity to estimate the return to schooling' by Card (1993). This paper also investigates the effects of education on wages. In this paper, education is instrumented by proximity to college indicator. (iii) 'The colonial origins of comparative development: An empirical investigation' by Acemoglu et al. (2001). This paper investigates the effects of extractive institutions (proxied by protection against expropriation) on the gross domestic product (GDP) per capita. The endogeneity of the explanatory variables are remedied by instrumenting protection against expropriation on early European settler mortality rates.
We have applied the different estimators OLS, TSLS, PULSE, and Fuller to the classical data sets Acemoglu et al. (2001), Angrist and Krueger (1991), and Card (1993). All models considered in Angrist and Krueger (1991) and Card (1993), where we estimate the effect of years of education on wages using quarter of birth and proximity to colleges as instruments, respectively, the OLS estimates are not rejected by our test statistic, and PULSE outputs the OLS estimates; see Online Appendix S9 for futher details. This may be either due to weak endogeneity (weak confounding) or that the test has insufficient power to reject the OLS estimates due to weak instruments or severe over-identification. Acemoglu et al. (2001) The dataset of Acemoglu et al. (2001) consists of 64 observations, each corresponding to a different country for which mortality rate estimates encountered by the first European settlers are available. The endogenous target of interest is log GDP per capita (in 1995). The main endogenous regressor in the dataset is an index of expropriation protection (averaged over 1985-1995), i.e., protection against expropriation of private investment by the respective governments. The average expropriation protection is instrumented by the settler mortality rates. We consider eight models, M1-M8, which correspond to the models presented in column (1)-(8) in table 4 of Acemoglu et al. (2001). Model M1 is given by the reduced form structural equations log GDP = avexpr · γ + μ 1 + U 1 , avexpr = log em4 · δ + μ 2 + U 2 , where avexpr is the average expropriation protection, em4 is the settler mortality rates, μ 1 and μ 2 are intercepts, and U 1 and U 2 are possibly correlated, unobserved noise variables. In model M2 we additionally introduce an included exogenous regressor describing the country latitude. C The Author(s) 2022.

5.1.
Downloaded from https://academic.oup.com/ectj/article/25/2/404/6380481 by Det Kongelige Bibliotek user on 06 July 2022 In models M3 and M4 we fit model M1 and M2, respectively, on a dataset where we have removed Neo-European countries, Australia, Canada, New Zealand, and the United States. In models M5 and M6 we fit models M1 and M2, respectively, on a dataset where we have removed observations from the continent of Africa. In models M7 and M8 we again fit models M1 and M2, respectively, but now also include three exogenous indicators for the continents Africa, Asia, and other. Table 1 shows the OLS and TSLS estimates (which replicate the values from the study), as well as the Fuller(4) and PULSE estimates for the linear effect of the average expropriation protection on log GDP. In model M1, for example, we see that the PULSE estimate suggests that the average expropriation risk linear effect on log GDP is 0.6583, which is 26% larger than the OLS estimate but 34% smaller than the TSLS estimate. In models M5-M8, the OLS estimates are not rejected by the Anderson-Rubin test, so the PULSE estimates coincide with the OLS estimates.
We can also use this example to illustrate the robustness property of K-class estimators; see Theorem 2.1. Even though interventional data are not available, we can consider the mean squared prediction error when holding out the observations with the most extreme values of the instrument. Depending on the degree of generalization, we indeed see that the PULSE and Fuller tend to outperform OLS or TSLS in terms of mean squared prediction error on the held out data; see Online Appendix S9.3 for further details.

SUMMARY AND FUTURE WORK
We have proven that a distributional robustness property similar to the one shown for anchor regression (Rothenhäusler et al., 2021) fully extends to general K-class estimators of possibly nonidentifiable structural parameters in a general linear structural equation model that allows for latent endogenous variables. We have further proposed a novel estimator for structural parameters in linear structural equation models. This estimator, called PULSE, is derived as the solution to a minimization problem, where we seek to minimize mean squared prediction error constrained to a confidence region for the causal parameter. Even though this region is nonconvex, we have shown that the corresponding optimization problem allows for a computationally efficient algorithm that approximates the above parameter with arbitrary precision using a simple binary search C The Author(s) 2022.
Downloaded from https://academic.oup.com/ectj/article/25/2/404/6380481 by Det Kongelige Bibliotek user on 06 July 2022 procedure. In the under-identified setting, this estimator extends existing work in the machine learning literature that considers invariant subsets or the best predictive sets among them: PULSE is applicable even in situations when no invariant subsets exist. We have proven that this estimator can also be written as a K-class estimator with data-driven κ-parameter, which lies between zero and one. Simulation experiments show that in various settings with weak instruments and weak confounding, PULSE outperforms other estimators, such as the Fuller(4) estimator. We thus regard PULSE as an interesting alternative for estimating causal effects in instrumental variable settings. It is easy to interpret and automatically provides the user feedback in case that the OLS is accepted (which may be an indication that the instruments are too weak) or that the TSLS is outside the acceptance region (which may indicate a model misspecification). We have applied the different estimators to classical data sets and have seen that, indeed, K-class estimators tend to be more distributionally robust than OLS or TSLS.
There are several further directions that we consider worthwhile investigating. This includes better understanding of finite sample properties and, for the identified setups, the study of loss functions other than MSE. It would be helpful, in particular with respect to real world applications, to understand to what extent similar principles can be applied to models allowing for a time structure of the error terms. We believe that the simple primal form of PULSE could make it applicable for model classes that are more complex than linear models (see also Christiansen et al., 2020). Our procedure can be combined with other tests, and it could be interesting to find efficient optimization procedures for tests that are robust with respect to weak instruments, such as Kleibergen's K-statistic (Kleibergen, 2002) for example. In an under-identified setting, the causal parameters are not identified, but the solutions obtained by optimizing predictability under invariance might be promising candidates for models that generalize well to distributional shifts.