The Twin Instrument: Fertility and Human Capital Investment

Twin births are often used to instrument fertility to address (negative) selection of women into fertility. However recent work shows positive selection of women into twin birth. Thus, while OLS estimates will tend to be downward biased, twin-IV estimates will tend to be upward biased. This is pertinent given the emerging consensus that fertility has limited impacts on women's labour supply, or on investments in children. Using data for developing countries and the United States, we demonstrate the nature and size of the bias in the twin-IV estimator of the quantity-quality trade-off and estimate bounds on the true parameter.


Introduction
Following Becker (1960), fertility has been modeled jointly with investments in children and with women's labour force participation. In line with the average tendency for negative selection into fertility, linear least squares estimates of associations of fertility with children's human capital, and with women's employment tend to be upward biased. Since the pioneering work of Rosenzweig and Wolpin (1980a,b), a considerable literature has attempted to address selection by using twins to instrument fertility (see Appendix Table A1). The premise is that twin births are quasi-random, so that the event of a twin birth constitutes a "natural" natural experiment (Rosenzweig and Wolpin, 2000).
In a recent paper (Bhalotra and Clarke, forthcoming), we presented new population-level evidence that challenges this premise. Using individual data for 17 million births in 72 countries, we demonstrated that indicators of the mother's health, her health-related behaviours, and the prenatal health environment are systematically positively associated with the probability of a twin birth. The estimated associations are large, evident in richer and poorer countries, evident even among women who do not use IVF, and hold for sixteen different measures of health. We provided evidence that selective miscarriage is the likely mechanism.
The upshot of our findings is that women who have twin births are positively selected on unobservables related to health. If, as is plausible (and we will demonstrate), those unobservables are correlated with child human capital or with women's labour force participation, then twin-instrumented estimates of the relationship between fertility and child outcomes, or women's labour supply will tend to be upward biased, moving towards a null-estimate. This is pertinent as it could resolve the ambiguity of the available evidence on these relationships. Recent studies using the twin instrument reject the presence of a quantityquality (QQ) fertility trade-off , challenging a longstanding theoretical prior of Becker (1960); Becker and Lewis (1973); Becker and Tomes (1976).
Similarly, research using the twin instrument finds that additional children have relatively little influence on women's labour market participation, at least after the first few years (Rosenzweig and Wolpin, 1980a;Bronars and Grogger, 1994;Jacobsen et al., 1999;Vere, 2011). In principle, addressing the omission of maternal health related variables could adjust for the downward bias in these studies, and provide a true estimate of the trade-offs. In practice, maternal health is multi-dimensional and almost impossible to fully measure and adjust for. To take a few examples, foetal health is potentially a function of whether pregnant women skip breakfast (Mazumder and Seeskin, 2015), whether they suffer bereavement in pregnancy (Black et al., 2016), and fetal exposure to air pollution (Chay and Greenstone, 2003).
In this paper we investigate how inference in a literature concerned with causal effects of fertility on human capital can proceed with partial adjustment and bounding. We first illustrate the hypothesized direction of the bias of the twin-IV estimator, by introducing available controls for maternal health in the estimation. Since this adjustment is necessarily partial, we proceed to estimate bounds on the IV estimates. Given that the first stage (twins predicting fertility) is powerful, we follow Conley et al. (2012) in estimating bounds on the premise that twin births are plausibly if not strictly exogenous. In a sensitivity check, we also estimate bounds under the different assumptions of Nevo and Rosen (2012), again using twin births as an "imperfect instrumental variable".
We provide estimates for the US using about 225,000 births, drawn from the US National Health Interview Surveys (NHIS) for 2004-2014, and for a pooled sample of developing countries, containing more than 1 million births in 68 countries over 20 years, available from the Demographic and Health Surveys, or DHS. These data are chosen because they contain information on child outcomes and maternal health. Consistently using these two samples allows us to assess the generality of our findings, and it allows that the relationship of interest, as well as the violation of the exclusion restriction that concerns us, are different in richer vs. poorer countries.
We start by briefly demonstrating, on the particular data samples used in this analysis, 3 our earlier result that the probability of twin birth is significantly positively associated with indicators of maternal health. We then set the stage by showing the routine OLS and twin-IV estimates on our data samples. The OLS estimates suggest a fertility-human capital trade-off and, following (Altonji et al. (2005)) to gauge the importance of unobservables, we conclude that accounting for unobservables is unlikely to dissolve the trade-off. The twin-IV estimates replicate, in our samples, the finding in recent studies that there is no discernible trade-off.
However, adjusting for available maternal health related characteristics, even though these are only a small subset of the range of relevant indicators, leads to emergence of a QQ trade-off.
This finding generalizes to recent non-linear models of the QQ trade-off Mogstad and Wiswall, 2016), holding even when the impact of fertility is allowed to vary by parity. For instance, in samples with at least three births, an additional child is associated with lower human capital outcomes for the first two births: this is estimated as 0.05 s.d. for years of education in developing countries, and 0.06 s.d. for an index of child health in the US, and in the sample with at least two births it is 0.10 s.d. for grade progression in the US (or 0.22 fewer grades progressed).
The bounds also confirm the presence of a trade-off at certain parities for education or health outcomes.The lower bound is -0.05 to -0.06 s.d. for education in developing countries, -0.13 to -0.24 s.d. for education in the USA and -0.02 to -0.10 of a s.d. for child health in the USA. 1 Observe that the trade-off is no smaller in the USA than in developing countries. This is important given that the recent studies arguing there is no trade-off are set in richer countries, and a natural reconciliation of these results with earlier studies proposed is that the trade-off may exist but only in poorer countries where a larger share of families is credit constrained.
This said, the US sample is considerably smaller than the developing country sample and bounds are correspondingly wider. As a result, bounds are uniformly more informative in the developing country sample.
The results indicate that marginal increases in fertility often lead to diminished investments in the human capital of children, and the trade-off is not negligibly small. This is important, especially in view of growing evidence of the long run dynamic benefits of childhood investments (Heckman et al., 2013). These estimates put back on the stage the issue of a potential human capital cost to fertility. Governments actively devise policies to influence fertility, for instance, countries like China have penalized fertility, while many countries including Italy and Canada have incentivized it, often with non-linear rules. 2 Moreover, advocates of policies encouraging smaller families rest their case on larger families investing less in the quality of each child, limiting human capital accumulation and living standards (Galor and Weil, 2000;Moav, 2005).

The Fertility-Investment Trade-off and the Twin Instrument
A long-standing theoretical result in the literature on human capital formation is the existence of a quantity-quality (QQ) trade-off (Becker, 1960;Becker and Lewis, 1973;Willis, 1973;De Tray, 1973;Becker and Tomes, 1976). The essential idea of these studies is that the shadow price of child quality is increasing in child quantity and vice versa. This provides behavioural micro-foundations consistent with an empirical regularity that has been noted in cross-sectional and time series data, which is that children from large families have weaker educational outcomes (Hanushek, 1992;Blake, 1989;Galor, 2012). We replicate this pattern using our data samples from the USA and developing countries (see Appendix Figures A1 and   A2).
However, empirical evidence of a QQ trade-off is ambiguous. Early work including Hanushek (1992) and Rosenzweig and Wolpin (1980a) documented significant negative effects of addi-2 As discussed in Mogstad and Wiswall (2016), families with children receive special treatment under the tax and transfer provisions in 28 of the 30 Organization for Economic Development and Cooperation countries (OECD (2002)). Many of these policies are designed such that they reduce the cost of having a single child more than the cost of having two or more children, in effect promoting smaller families. For example, welfare benefits or tax credits are, in many cases, reduced or even cut off after reaching a certain number of children.
tional births within a family on average child educational outcomes. Using IV or difference-indifferences approaches, recent studies include estimates of a significantly positive relationship (Qian, 2009), a significant negative relationship (Grawe, 2008;Ponczek and Souza, 2012;Lee, 2008;Bougma et al., 2015) and no significant relationship Fitzsimons and Malde, 2010), see the review in Clarke (2018). It has been argued that where the usual twin-IV approach identifies no significant relationship, allowing for non-linear and non-monotonic effects of family fertility on children's education leads to emergence of a negative relationship Mogstad and Wiswall, 2016). In this paper, we assess twin-IV estimates on two different data samples, examining sensitivity to adjustment for maternal health in linear and non-linear models, and to (small or signed) violations of the exclusion restriction. 3 In this paper we focus nearly exclusively on the internal validity of twins estimates (IV consistency). In recent work, ; Bisbee et al. (2017) examine the external validity of the twin instrumented or sex-mix instrumented estimates of the impact of fertility on female labour supply. 4 3 Methodology

Estimating The Quantity-Quality Trade-off with Twins
Analyses of the QQ trade-off attempt to produce consistent estimates of α 1 in the following population-level equation: Here, quality is a measure of human capital attainment of, or investment in, child i in family j, and quantity is fertility or the number of siblings of child i. A significant QQ trade-off implies 3 The twin instrument has also been used to estimate varying effects of childbearing on women's labour force participation (Rosenzweig and Wolpin, 1980b;Jacobsen et al., 1999;Angrist and Evans, 1998), and the consequences of out of wedlock births on marriage market outcomes, poverty and welfare receipt (Bronars and Grogger, 1994). 4 We note that like , our estimates suggest considerable heterogeneity by country income levels. We also observe heterogeneity by child gender. that α 1 < 0. Relevant family and child level controls are included, denoted X. As has been extensively discussed in a previous literature, estimation of α 1 using OLS will result in biased coefficients given that child quality and quantity are jointly determined (Becker and Lewis, 1973;Becker and Tomes, 1976), and unobservable parental behaviours and attributes influence both fertility decisions, and investments in children's education (Qian, 2009). The direction of the OLS bias is determined by the sign on the conditional correlation between quantity j and the unobserved error term: E[quantity j · ε ij |X]. If mothers with weaker preferences for child quality have more children, OLS estimates will overstate the true QQ trade-off.
Following the seminal work of Rosenzweig and Wolpin (1980a), fertility has been instrumented with the incidence of twin births on the premise that they constitute an exogenous shock to family size. The 2SLS specification can be written as: where twin j is an indicator for whether the n th birth in family j is a twin birth. As described further in section 4, a series of samples are constructed, referred to as the n+ groups, and consisting of children born before birth n in families with at least n births. The idea is that children born prior to birth n (subjects) are randomly assigned either one sibling (and make up the control group) or two siblings (and make up the treatment group) at the n th birth, and this allows us to estimate causal impacts of the additional birth on investments in, or outcomes of, these children. The twins themselves are excluded from the estimation sample. 5 If twins are a valid instrument, the parameter β 1 is consistent and hence in limit equal to the parameter α 1 from the population equation 1.
5 This takes care of the concern that since twins tend to be born with weaker endowments (e.g. birth weight), they will tend to have systematically different quality outcomes. Using data from the US,  document that twins have substantially lower birth weight, lower APGAR scores, higher use of assisted ventilation at birth and lower gestation period than singletons. We document similar endowment differences in our data samples (Appendix Figure A3 and A4).

7
Bhalotra and Clarke (forthcoming) provide evidence that omitted variables for maternal health may contaminate η ij , and in Section 5.1 we document this for the data used in this paper. If mothers who invest more in their pregnancies (for instance by averting smoking before birth) also invest more in their children after birth, then the twin-IV estimates will be inconsistent. There is some evidence for instance in Uggla and Mace (2016) that healthier mothers (indicated by health measures such as used in our earlier work) invest more in children in a range of domains. Positive selection of mothers of twins implies: We can partition the stochastic error term from equation 2b into a vector of observable measures of mother's health capital (H), socioeconomic variables (S), and all other unobserved components, as η ij = H + S + η * ij . Assuming a positive (or zero) covariance between the three components of the error term, 6 the step-by-step removal of selection predictors will result in the estimated coefficient becoming continually closer to the true parameter. Thus: The coefficientsβ H 1 andβ S+H 1 refer to coefficients in a model augmented to control for observable health capital H, and then also observable socioeconomic status S. Since, as discussed further in section 5.1, all determinants of twin birth are virtually impossible to account for, twin-IV will under-estimate the magnitude of the QQ trade-off, although addition of predictors of twins as controls will lead to the estimate approaching the true value from above.

Estimating IV Bounds with an Imperfect Instrument
Given that we can never fully control for maternal health even with the full set of observable controls, point estimation of the QQ trade-off is not possible. However, under additional assumptions relating to the failure of the IV exclusion restriction, or correlations between the IV, the endogenous variable, and unobservables we can bound the QQ trade-off. In order to proceed based on IV even in the presence of twin selection, we follow two procedures to bound the QQ trade-off using the (potentially imperfect) twin instrument.
The first of these is the Nevo and Rosen (2012) "Imperfect IV" procedure. This procedure is ideally suited to the context examined here, as it suggests that if twins are positively selected and if fertility is negatively selected, and if twinning and fertility are positively correlated, then the true parameter will be bounded by the OLS and the IV estimate discussed above. 7 If we are willing to additionally assume that the twin instrument is "less endogenous" than fertility (Nevo and Rosen's assumption 4), we can further tighten the bounds by forming a compound instrument based on the endogenous fertility variable, and the imperfect twin instrument. This instrument, (V = σ quantity T win j − σ T win quantity j ), where σ refers to the standard deviation, can provide tighter bounds on the β 1 parameter where β V IV ≤ β 1 ≤ β twin IV , suggesting end points for a series of IV bounds on the parameter β 1 .
The Nevo and Rosen (2012) procedure is straightforward and relies on quite weak assumptions. Namely, to produce bounds in this case, the only additional assumptions we require are that there is negative selection into fertility, a widely accepted stance in the literature (Qian, 2009), and one that is verified in surveys querying fertility preferences, which show that less educated women desire more children (e.g. Bhalotra and Cochrane (2010)); that twins are positively selected, which is shown in Bhalotra and Clarke (forthcoming), that twin births are positively associated with fertility, which we show in the first stage regressions below, and that there is less selection into twin birth than into fertility, which seems reasonable.
The upper bound in the case of Nevo and Rosen is the upper end of the 95% confidence 7 We can follow the notation of Nevo and Rosen (2012) precisely if we multiply twins by -1, as their assumptions and lemmas are based on identically signed correlations between the endogenous variable and unobservables, and the IV and unobservables. In our case, once twins is multiplied by -1, assumption 3 is met assuming negative fertility selection and positive twin selection: ρxuρzu ≤ 0, where ρ denotes correlation. In the notation of our paper, x refers to quantity in equation 1, z refers to twin in equation 2a, u refers to the unobservable stochastic term εij in 1. Then, under Nevo and Rosen (2012, Lemma 1), σxz < 0, or the negative of twins and fertility will be negatively correlated, and as such β twin IV ≤ β1 ≤ βOLS. 9 interval on the original twin IV estimate β twin IV . From equation 3 we know that positive selection of twins inflates this IV estimate upwards. As such, to offer a more informative identification region at the upper bound, we also implement an alternative approach to inference for IV models developed by Conley et al. (2012) for cases when the instrument is plausible but fails the exclusion restriction. They provide an operational definition of plausibly (or approximately) exogenous instruments, defining a parameter γ that reflects how close the exclusion restriction is to being satisfied in the following model (adapted to the QQ model for this paper): Since the parameters δ 1 and γ are not jointly identified, prior information or assumptions about γ are used to obtain estimates of the parameter of interest, δ 1 . The IV exclusion restriction is equivalent to imposing ex-ante that γ is precisely equal to zero. Rather than assuming this holds exactly, one can define plausible exogeneity as a situation in which γ is nearly, but not precisely equal to zero. Estimating or imposing some (weaker) restriction on γ buys the identifying information to bound the parameter of interest, even when the IV exclusion restriction does not hold exactly. 8 Conley et al.'s methods are ideally suited to the empirical application of this paper because they show that their bounds are most informative when the instruments are strong, and the twin instrument is strong (evidence below). In section 5.1, we provide evidence that leads us to suspect that γ will not equal zero. Specifically, γ will reflect the effect of unobserved maternal health on child quality, interacted with the degree to which twin mothers are healthier than non-twin mothers. 9 8 Conley et al. (2012) state that " Manski and Pepper (2000) consider treatment effect bounds with instruments that are assumed to monotonically impact conditional expectations, which is roughly analogous to assuming γ ∈ [0, ∞]". The procedure we follow here is hence an extension of the Manski and Pepper procedure.
9 If one or other of these conditional correlations is equal to zero, IV estimates will not be inconsistent. Section 5.1 only shows that twin mothers are healthier than mothers of singletons. To complement this, we also show below a series of positive associations of maternal health and both investments in children and child outcomes. We also discuss how this can be estimated in reduced form from natural experiments in particular settings. Conley et al. (2012) show that bounds for the IV parameter β 1 from equation 2b can be generated under a series of assumptions regarding γ. These include a simple assumption regarding the support of γ (their "Union of Confidence Intervals", or UCI, approach), or a fully specified prior for the distribution of γ (their "Local to Zero", or LTZ, approach). In the latter case, a correctly specified prior often leads to tighter bounds. We follow both strategies, the first is agnostic, placing little structure over the violation of the exclusion restriction by simply allowing a large range for γ, and the second involves estimating γ as a(n auxiliary) model parameter.
In general, the Conley et al. (2012) procedure relies on additional assumptions, as we must form a prior over the magnitude of the failure of the exclusion restriction, while in Nevo and Rosen (2012) we only need to provide the sign. 10 The advantage of the Conley et al. procedure that makes it worthwhile despite its stronger assumptions, is that it potentially returns tighter bounds on both the upper and lower end, while Nevo and Rosen retains the original IV upper bound and only tightens the lower bound using information from the original OLS estimates.

Data and Descriptive Statistics
We shall consistently estimate OLS and twin-IV estimates employing microdata from the US and from a sample of 68 developing countries. In order to estimate the (health and SES augmented) specification 1, we require information on sibling-linked births, measures of child quality and characteristics of the mother that include indicators of her health in addition to the more commonly available age, race and education. The data we use are chosen to satisfy these requirements. These are the US NHIS, which have been fielded in an identical way from 2004-2014, and the DHS for 68 countries, which have been applied over 20 years using a broadly similar design.
In both data sets, children are included in the sample if aged between 6 and 18 years when surveyed. While ideally we would observe completed education, to our knowledge no large datasets are available measuring child's completed education, mother's total fertility, and a wide range of maternal health measures taken before the birth of the child. We would have liked to use the data used in recent prominent studies of the QQ trade-off Mogstad and Wiswall, 2016), but the Israeli data do not contain indicators of maternal condition or maternal behaviours, and the Norwegian data are not publicly accessible, and additionally contain very few markers of maternal health.
A measure of child 'quality' available in both data sets is educational attainment. Since the children are 6-18 and in the process of acquiring education, we use an age-standardized z-score.
In the DHS, the reference group consists of children in the same country and birth cohort, while in the NHIS, it consists of children with the same month and year of birth. Thus coefficients are expressed in standard deviations. While in the developing country setting relative school progress is an appropriate measure of child human capital given high rates of dropout and/or over-age school entry, this is not the case in the USA. In these data, grade-retention is a relevant measure of educational progress. It is estimated that between 2 and 6% of children are held back at least one grade in primary school (Warren et al., 2014). Grade retention has also been documented to have substantial subsequent impacts on school drop-out and long term attainment (Manacorda, 2012). The NHIS also provides a subjectively assessed binary indicator of child health (excellent or not), which we model as an additional indicator of child quality. 11 Case et al. (2002) have demonstrated that an identical self-reported measure of health predicts mortality and morbidity in the US population. Further details on all variable definitions are provided in Appendix B.
Appendix Table A2 provides summary statistics for the DHS and NHIS data. Fertility and maternal characteristics are described at the level of the mother, while child education, and health outcomes are described at the level of the child. Twin births make up 1.98% of all births in the DHS sample, and 2.57% in the NHIS sample. As expected, twin families are larger than non-twin families. Figure 1 describes total fertility in twin and non-twin families. The distribution of family size in families where at least one twin birth has occurred dominates the distribution for all-singleton families in both the DHS sample ( Figure 1a) and the US sample ( Figure 1b). This establishes the relevance (power) of the twin instrument for fertility, which is formally assessed below.
Estimation Samples Studies that instrument fertility with the occurrence of a twin birth leverage the unexpected additional child to study impacts on outcomes of siblings born before the additional child. Define families with at least two birth events as 2+ families. In this group, we shall compare families in which twins occur at the second birth event (treated group) with families in which a singleton occurs at second order (control group). The subjects, for whom we measure indicators of child quality (proxies for parental investment) are the first-born children. Following , we similarly construct a 3+ sample which consists of families with at least three birth events and then we compare outcomes for the first two births across families that have a twin birth at order three (treated) and families that have a single birth at order three (control). Many existing studies, such as , focus upon the 2+ and 3+ samples. Given higher fertility rates in the developing country sample that we analyse, we also include 4+ families in which twins occur at fourth order and outcomes are studied for the first three births.
Restricting the sample to families with at least n births in this way primarily ensures that we avoid selection on preferences over family size. It also addresses the potential problem that, since the likelihood of a twin birth is increasing in birth order (see Appendix Figures A5 and A6), increasing family size raises the chances of having a twin birth. In the DHS sample, 42% of all children are in on of the 2+, 3+ or 4+ samples. In the US sample, this value is 13 45%. Children will be in none of these samples if they are either high birth order children, or if they are low birth order children who do not have older siblings.

Twin Births and Maternal Condition
In Bhalotra and Clarke (forthcoming) we document that mothers with greater health stocks prior to conception or those who engage in more healthy behaviours or are in a healthier environment during pregnancy are more likely to take twins to term. In other words, twins are born to selectively healthy mothers. In order for this to invalidate twin-IV estimates, two conditions must be satisfied. First, twins must be non-random conditional on observable controls (non-independence) and second, twins must have an impact on the outcome of interest beyond that mediated by fertility (non-excludability). Here we document that this is the case in the two data samples used in this paper, and direct readers to Bhalotra and Clarke (forthcoming) where additional evidence in other contexts is presented.
Using the two data sets analysed in this paper, we regress the probability of a twin birth on indicators of maternal health, holding constant socioeconomic status and demographic characteristics. In the US sample (which is much smaller, limiting statistical power, see Table 1), twinning is positively associated with mother's education and BMI, and negatively associated with the mother's smoking status prior to the birth. The smoking indicator is statistically significant even in the pre-IVF period. In Bhalotra and Clarke (forthcoming) we use the universe of births in the US, between 2010 and 2013, and after removing births assisted by Artificial Reproductive procedures such as IVF, we document negative associations of twinning with diabetes and hypertension before pregnancy, with smoking before and during pregnancy and with being short or underweight before pregnancy. 12 In the developing country sample (Table 2), we observe that, conditional on maternal age and country and year of birth fixed effects, twin births are positively associated with the mother's education and health, proxied by her height and body mass index (BMI). This result holds even in a period before IVF became available (column 5), and in both low and middle income countries. We also identify a statistically significant positive impact of public health availability on the likelihood of twinning (column 6). 13 We also investigated whether the source of twin non-randomness additionally has a direct effect on the outcome of interest. This seems plausible since mothers with better health stocks and mothers engaging in positive behaviours prior to pregnancy are likely to be healthier themselves and have stronger preferences over health and educational investments in children following pregnancy, with direct impacts on child outcomes. Evidence of positive causal effects of maternal health with child health or education is not so easy to find but evidence of associations for health is in Uggla and Mace (2016) and Kahn et al. (2002). We document similar associations using our analysis samples. The US results are in Table 3. We regress available measures of child investment (whether the child has any type of health coverage) and outcomes (whether the child has any health limits, the child's standardised educational achievement, and whether the child is classified by parents as being in excellent health), on the maternal characteristics documented to predict twinning in this sample. In each case, we observe that positive maternal health measures are correlated with a reduced likelihood of having health limitations or not having insurance (columns 1-2), and correlated with positive measures of human capital outcomes (education and self-informed health status; columns 3-4). The developing country results are in Table 4. Maternal height, BMI and education are all positively associated with the likelihood of making more positive antenatal investments in child outcomes (the number of appointments, and the likelihood of giving birth at home rather than in a medical centre). We also see impacts of the same maternal health indicators on the child's education. 14 In summary, there is compelling evidence that mothers of twins are selectively healthy.
There is also suggestive evidence that healthier women make greater investments in children and that their children have better human capital outcomes. We will test this more formally when progressively introducing controls in IV models in the following section.

The QQ Trade-off
We now turn to estimates of the QQ trade-off. We initially present the routine OLS and twin-IV estimates since, under the assumptions about selection into fertility discussed in section 3.1, these provide bounds on the true parameter. In each case, we show how these estimates are modified upon addition of available controls for the mother's health. So as to ascertain that the indicators of health are not simply proxying for socio-economic status, we also introduce controls for mother's education. Our expectation is that the introduction of controls will tighten the bounds, diminishing the size of the trade-off estimated by OLS and increasing the size of the IV estimated trade-off. The former would confirm the hypothesis of negative selection into fertility and the latter would confirm positive selection into twin birth, affording a direct test of our hypothesis that the twin-IV estimator is biased downward by virtue of twins being born to healthier mothers.

OLS Estimates
OLS results for both samples are in Table 5. We consistently control for fixed effects for age of the child, age of the mother at birth, and the year of the survey. In the developing country sample we also condition on country fixed effects, and in the US sample on census region and mother's race fixed effects. We additionally show results with birth order controls. The available controls for mother's health are height, BMI and cluster-level health service availability in the developing country sample, and BMI and a self-reported assessment of own health on a Likert scale in the US sample. In both samples, the control for socioeconomic status is years of education of the mother (see Table A2 for summary statistics of these variables) and in the developing country sample we also control for the wealth quintile of the family.
The introduction of observable controls, first for mother's health and then also for her education progressively reduces the estimated trade-off to nearly half of the initial value in both samples, consistent with negative fertility selection. The adjusted estimates for education in developing countries are between 6.6 and 8.5% of a standard deviation. In the US they are between 1 and 2.5% for education and between 0.3 and 1.7% for health status. The Altonji et al. (2005) statistic for the DHS sample suggests that unobservable characteristics of the mother would need to be about 1 to 1.2 times as important as observables for these estimate of the QQ trade-off to be entirely driven by selection into fertility. The corresponding ratio in the US varies from between 1 to 3. In developing countries, the estimated education-fertility trade-off is decreasing in the birth order at which twins (the additional child) occur, i.e. it is largest in the 2+ sample and smallest in the 4+ sample. In the US, the trade-off is similar for the 2+ and 3+ samples and smaller and insignificant in the 4+ sample. However, for health, this "gradient" is reversed and the largest child health-fertility trade-off is in the 4+ sample and the smallest in the 2+ sample. In contrast to the case in , the controls for birth order do not eliminate the trade-off (Appendix Tables A3 and A4). IV Estimates: Developing Countries. The first stage estimates demonstrate the wellknown power of the twin instrument. It consistently passes weak instrument tests (the Kleibergen-Paap rk statistic and its p-value are presented in panel A). The point estimates indicate that the incidence of twins raises total fertility by about 0.7 to 0.8 births. That this estimate is always less than one is in line with other estimates in the twin literature and is evidence of partial reduction of future fertility following twin births (compensating behaviour). Consistent with this, the first stage coefficient is increasing in parity. In panel B, the first column ("Base") for each parity group presents estimates ofβ 1 from equation 2a using the current state of the art twin-IV 2SLS estimator. In each of the three samples, in line with the findings of recent studies Fitzsimons and Malde, 2014;Åslund and Grönqvist, 2010), we find no significant QQ trade-off. This is not simply because IV estimates are less precise than OLS estimates (as emphasized in ), rather, the coefficients are much smaller.

IV Estimates with the Twin Instrument
Consistent with our hypothesis and the evidence we present in section 5.1 that twin mothers are positively selected on health (and education), we see that upon introducing controls for maternal selectors of twinning, a QQ trade-off emerges in the 3+ and 4+ samples, even though the available controls are almost certainly a partial representation of the range of relevant facets of maternal health stocks, health-related behaviours and environmental influences on foetal health. The bias adjustment is meaningful and statistically significant. In the 3+ sample, the commonly estimated specification produces a point estimate of 2.8% which is not statistically significant, and partial bias adjustment raises this to 4.1% (conditional on maternal health indicators) or 4.6% (if mother's education is also included). In the 4+ sample, the corresponding figures are 2.7% and 3.7%.
While one way to compare the base and full control specifications is to test whether each coefficient differs from zero, an alternative test is to compare the estimated coefficients (and standard errors) to each other. We thus also test each coefficient compared to the "Base" coefficient, and present the p-values of this test as "Coefficient Difference" at the foot of panel B. We can often reject equality of the coefficients in the specifications with and without controls for maternal health. Implementing these tests requires that we take account of the correlations between error terms in each model. In order to do this we replicate IV estimates using GMM, which allows us to estimate models simultaneously and hence compare coefficients across models. Additional details related to this test are provided in Appendix C.
IV Estimates: United States The first stage estimates for the US sample (Table 7) are similar to those for the developing country sample, with a twin birth at parity 2, 3 or 4 leading to an additional 0.7 to 0.8 total births. The second stage estimates also follow a similar pattern insofar as the baseline specification indicates no significant relationship between twin-mediated increases in fertility and either the indicator of school progression, or the indicator of child health. However, upon the introduction of controls for maternal health and education, the coefficient describing the QQ trade-off tends to increase in magnitude. In the case of education, it grows more negative in each sample and is statistically significant in the 2+ sample, with a point estimate of 10.2%. When child quality is indicated by health, the point estimate in the 2+ sample remains insignificant but in the 3+ and 4+ samples it grows more negative and in the 3+ sample it is statistically significant at 5.9%. Notice that the USA samples range between about 21,000 and 61,000 individuals while the developing country data samples range between about 260,000 and 400,000, so we have more limited statistical power with the US data. As discussed earlier in this section, it is well recognised that twin-IV estimates are often not precise. So it is quite striking that we find a significant trade-off for education and health.
Overall, partial bias adjustment reveals a statistically significant QQ trade-off for education in the 2+ sample (comprising about 50% of the total sample) and for health in the 3+ sample (comprising about a third of the total sample).
Recent work suggests that focusing on monozygotic (MZ) rather than dizygotic (DZ) twins may resolve issues related to the heritability of twinning and relationships between twinning and some maternal characteristics (Farbmacher et al., 2016). While we cannot observe whether a twin pair are MZ or DZ in either of our data sources, when we use only same sex twins to construct the twin instrument, as they are considerably more likely to be MZ, we observe a similar pattern, where once again estimates diverge from zero and become significant when controls for maternal health are included. Results for the DHS are in Table A8 and for the NHIS in Table A9.

Non-Linear Models
Theoretical statements of the QQ model tend to assume, for simplicity, that all children in a family have the same endowments and receive the same parental investments. More recent work, for example the theoretical work of , and empirical papers by Rosenzweig and Zhang (2009) (2013) relax this assumption. Among other things, this allows for reinforcing or compensating behaviours in parental investment choices . This implies that we should allow the coefficient β 1 to vary across children in the family.
Using DHS data for which we have a sufficiently large sample to split instruments, we re-estimate our regressions following the non-linear marginal fertility models of ; Mogstad and Wiswall (2016). We provide a full discussion of the methodology in Appendix D, and in the analysis below we follow the procedure laid out by Mogstad and Wiswall (2016) precisely. Models of this type loosen the linear marginal effects estimated on fertility, and allow for a one-unit shift in fertility at different birth orders to have potentially varied impacts on existing children.
We report the restricted (linear) and non-restricted (non-linear) IV models in Table 8, and the corresponding first stage results in Appendix D (Appendix Table A14). We report results by the same parity samples as the main IV results presented in Table 6.
In Table 8 we observe, firstly, that as described in Table 6, the linear specifications are universally lower, and often become statistically distinguishable from zero when partially controlling for the selection of twins as compared to the baseline estimate not controlling for twin selection. These results only differ from those reported earlier in that we now restrict the sample to families with 6 children or fewer in line with results reported in Mogstad and Wiswall (2016), which involves a loss of between 5 and 18 percent of the sample depending on the parity sample used. For full descriptives on family size in each parity group refer to Appendix Figure A7. Turning to panel B, we observe a similar non-linear dynamic as that reported in Mogstad and Wiswall (2016). For example, in the two-plus sample, we observe that the twin instrumented estimate of the effect of moving from one to two siblings is large and positive, while the impact of moving from two to three siblings is large and negative. However, most interestingly for the present analysis, the non-linear impacts are nearly universally larger in absolute terms when partially controlling for twin selection. As was the case with the linear model, we observe that the marginal fertility effects become nearly everywhere more negative, and in certain cases become statistically different from zero. Thus, our finding that the twin-IV estimator tends to under-estimate the causal effect of fertility on child human capital holds in the linear and non-linear specifications.

IV effect sizes in perspective
Since the QQ trade-off has been called into question, it is important to consider the size of the partially-bias-adjusted estimates and not just their sign and statistical significance. Our results (in the linear model) imply that an additional birth in a family is associated with 0.17 fewer years of completed education (developing countries) or 0.22 fewer grades progressed (USA). In a widely cited study, Jensen (2010) shows that providing students with information on the returns to secondary school in their area led, on average, to their completing 0.20-0.35 more years of school over the next four years. In a similarly high-profile experiment, Baird et al. (2016) find that de-worming in school led to an increase of 0.26 years of schooling and Bhalotra and Venkataramani (2013) find that a 1 s.d. decrease in under-5 diarrheal mortality (11 deaths per 1000 live births) is associated with girls growing up to achieve an additional 0.38 years of schooling, while both studies find no increase in school years for boys. Almond (2006) finds that foetal exposure to influenza in 1918 was associated with 0.126 years (1.5 months) less schooling at the cohort-level and  show that exposure to antibiotic-led reductions in pneumonia in infancy resulted in individuals completing 0.7 additional years of education in adulthood relative to unexposed cohorts. The PROGRESA cash transfer in Mexico is estimated to have generated a 0.66 increase in years of schooling (Schultz, 2004).
If we consider grade retention in the US, our estimates suggest that an additional birth results in 0.22 fewer years completed. This is of similar magnitude to estimates of the effect of an additional 1,000 grams of birthweight over the normal birthweight range (a 0.31 increase in years of schooling) in Royer (2009), and estimates of the impact of historical exposure to high rather than low malaria rates (a 0.4 year reduction) in Barreca (2010). Turning to the effects on health, we find that an additional birth (at order 3 or 4) reduces the likelihood that siblings are in excellent health by between 3-6%. Almond and Mazumder (2005) document that in the long-run, the 1918 influenza pandemic increased the likelihood of being in poor or fair health (the inverse of our health measure) by 10%. Overall, the adjusted estimates are of a size that it is not prudent to dismiss. Moreover, our estimates indicate the change in investment (education or health) for one additional birth but, as fertility rates remain high in many developing countries, the total effect can be large.

Generalised Bounds
The adjusted twin-IV results will not provide consistent estimates of β 1 as there are almost certainly omitted indicators of maternal health. Although documenting that observable mea-sures of health (which also impact child quality) are correlated with the instrument does not prove instrumental invalidity, it does suggest that it is highly likely that similar non-observable factors will also be correlated, thus resulting in invalidity. A recent study proposes a formal test of instrument invalidity (Kitagawa, 2015). Using the 2+ sample for the DHS data this test rejects the validity of the twin instrument -see Appendix Figure A8 and Table A10; however this test is sensitive to curse of dimensionality considerations, and so to implement it we had to simplify the specification of controls. 15 We do not report results for the NHIS data because the sample is too small to obtain informative confidence intervals.
Rather than discard the twin-IV estimator altogether, we harness its power in predicting fertility using IV bounds to assess the empirical significance of the omitted variables. As outlined in section 3.2, we begin by estimating Nevo and Rosen (2012) bounds. These are based on the assumptions that twins are positively selected and fertility is negatively selected.
Evidence for both of these assumptions is in Tables 5 and 6-7 where it is observed that controlling for education and health results in the OLS coefficients on fertility growing less negative and the IV coefficients on twins growing more negative. It is further assumed that twins is a less endogenous variable than fertility. The bounds are in Table 9 (columns 2-3; IV point estimates are presented for comparison in column 1). These estimates provide a lower bound on the QQ parameter estimated in Tables 6-7 of approximately 5-8% of a standard deviation across the DHS and NHIS samples. 16 As discussed in section 3.2, the upper bound in Nevo and Rosen's bounding procedure is determined by the upper bound of the 95% confidence interval of the original twin IV estimates. As such, estimates which are not significant at 95% confidence levels in Tables 6-7 will once again be non-informative when using the Nevo and Rosen (2012) procedure.
In order to gain additional precision in bounds estimates at the upper bound, we also estimate Conley et al. (2012) bounds. As discussed, we need to define a prior belief over the sign and magnitude that the coefficient on twin birth (γ) would take in equation 4. To begin, we assume a range of values for γ from 0 to 0.05, or 5% of a standard deviation, in which case instrument validity is violated, and having a twin mother has a positive effect on child quality conditional on fertility. The results are in Figure 2 for developing countries and Figure 3 for the US for the 3+ samples; results for the 2+ and 4+ samples are in Appendix Figures A9 and   A10. We assume γ ∼ U (0, δ) with δ displayed on the x-axis. Thus, when δ = 0, γ is exactly 0, and the bounds collapse to the 95% confidence interval for the traditional IV estimate.
Given that twin IV estimators tend to produce wide confidence intervals , Conley et al. (2012) bounds will also tend to be wide. As δ increases, the violation of the exclusion restriction increases. We observe, firstly, a widening of the estimated bounds as the size of the violation increases, 17 and secondly that the upper bound becomes increasingly negative, moving in the direction of finding a QQ trade-off. 18 In both figures the vertical red line displays our preferred estimate for γ, the estimation of which we discuss further below.
For developing countries and for the US (when the outcome is a measure of child health, but not for education, where the estimates are considerably less precise) we observe baseline IV results with bounds that are not informative of the sign of the trade-off when the exclusion restriction is assumed to hold exactly. However, as γ grows, the bounds do quickly become informative, suggesting that with a γ as low as 0.002 in the US or 0.008 in developing countries, a significant QQ trade-off emerges. While using an interval of values for γ has the advantage of being unrestrictive (0.05 is a very large value for the exclusion restriction), the bounds are quite wide.
With a view to improving the precision and relevance of these bounds, we estimate rather 17 As Conley et al. (2012) discuss, the degree of failure of the exclusion restriction is analogous to sampling uncertainty related to the IV parameter β1. As the exclusion restriction is increasingly relaxed, the "exogeneity error" (in Conley et al.'s terminology) related to the instrument inflates the traditional variance-covariance matrix. than assume γ, the measure of the extent of the violation of the exclusion restriction. This is (as usual) the product of two relationships which, here, are the relationship between the probability of a twin birth and maternal health, and the relationship between maternal health and investments in children. The data requirements for this are non-trivial-we need data on two generations, with an exogenous shock to maternal health in the first generation, and measures of child quality in the second generation. For this, we exploit natural experiments in the US and Nigeria. This is in line with Conley et al. (2012) who illustrate their estimator with examples involving back of the envelope calculations of γ for each case. In Appendix E we detail how we leverage two historical natural experiments involving a shock to the health of women, namely, the Biafra war in Nigeria and the introduction of the first antibiotics to the US, to estimate γ. 19 We also conduct a number of back of the envelope plausibility tests. In general these suggest that γ is around 0.004-0.006, or that having a (positively-selected) twin mother has a direct effect of around 0.4 to 0.6% of a standard deviation in quality outcomes.
As we outline at more length in Appendix E.1 and E.2, the generation of this estimate for γ is based on particular shocks which impact maternal health. We present evidence supporting the assumptions for these estimates in Appendix E.3, and put these in the context of Conley et al.'s methods in Appendix E.4. These reduced form estimates of γ based on exogenous events provide a well founded estimate to use in the Conley et al. (2012) procedure, but one may be concerned about external validity of these estimates, given that they are derived from 1930s America (sulfa drugs) and 1970s Nigeria (Biafra war). We can, however, show that estimates of γ from contemporary DHS data (which are used in the main analysis and hence relevant for estimates of γ) are in fact of the same order of magnitude as our estimates from America and Nigeria. Consider Appendix Table A11, which shows that a one standard deviation change in maternal BMI is associated with a 0.070 s.d. increase in the child's educational Z-score (column 4). We observe that twin mothers in the same data sample have BMI 0.050 s.d. higher than non-twin mothers. Scaling (multiplying) this by the estimated association (0.070×0.050) produces an estimate of gamma (a measure of the violation in the exclusion restriction, or the twin-mediated effect of maternal BMI on child outcomes) of 0.0035 s.d. This is of the same order of magnitude as the value of γ that we estimate from the Biafra (0.0040) and Sulfa (0.0062) case studies. We can calculate a range of such estimates using education and height (as well as BMI), and find values of 0.025 for education (0.215×0.0121) and 0.00196 for height (0.019×0.103). Importantly, all of these values fall within the estimated distributions of γ used to calculate Conley et al. bounds, displayed in Appendix Figure A13.
It is important to note that while degree of the violation of the exclusion restriction is estimated to be relatively small (at 0.4 or 0.6 percent of a standard deviation of the child quality measure, education), γ is obtained after scaling the estimated impact of maternal health shocks/characteristics (that predict twinning) on the final outcomes of interest (child quality indicators). The scaling factor is the difference in the maternal health indicator between mothers of twins and mothers of singletons -this is 0.050 in the BMI example above, it is 0.125 in the sulfa experiment, and 0.267 in the Biafra experiment (all figures from Appendix Table A15). Thus γ is in fact much smaller than the measure of violation that is of interest.
Using these estimates of γ, we are able to pin down the bounds described in Figures 2-3. See Table 9, columns 4-5, where we present the UCI approach in which we assume that γ ∈ [0, 2γ]. This assumption is chosen such that the trueγ described in Appendix E in each case will lie precisely in the middle of the confidence interval, following Conley et al. (2012)'s empirical example. For the LTZ approach, we use estimates of both γ and its distribution, which allow uncertainty for our estimates of γ and assume that γ is distributed precisely according to the estimated empirical distribution (refer to Appendix E.5).
Our preferred bounds estimates are those in the right-hand columns of Table 9, as these are more efficient, being based on the estimated bootstrap distribution. For the developing country sample, estimates of the QQ trade-off in determining educational attainment, in the 3+ and 4+ samples, are bounded between slightly less than zero and 6% of a standard deviation and the mid-point of these bounds falls at 2.6% and 3.7% of a standard deviation respectively. An additional sibling thus does appear to depress a child's educational attainment.
For the US sample we see that while the mid-point of the bounds is virtually always negative (health in the 2+ group is the only exception), the bounds are most informative for the 2+ (education) and 3+ (health) samples. These indicate that an additional birth reduces the grade progression of an older sibling by 16.6% of a s.d. (upper bound), or 8.3% of a s.d.
(mid point), and their likelihood of being reported as being in excellent health by 7% (upper bound) or 3.5% (mid point).
In Figure 4 we plot the estimated coefficients and bounds for the developing country sample altogether so they are readily compared. The corresponding plot of all estimates for the (much smaller) US sample is in Appendix Figure A11). The figures show the OLS and IV estimates, with base controls, health, and health and socioeconomic controls and we show the Nevo-Rosen and Conley et al. bounds for each of the 2+, 3+ and 4+ groups. The informativeness of the bounds is evaluated against the criteria laid out by Hotz et al. (1997): firstly do the bounds enable us to determine if the effect is negative or positive, secondly can we reject the point estimates of linear IV, and thirdly do our bounds allow us to reject the OLS estimate of the causal effect. In general, for the 3+ and 4+ samples in the DHS data, the bounds are informative of the (negative) sign of the trade-off, but not for the 2+ sample. In terms of the second and third criteria, we can never exclude the point estimate of the original IV estimate from our bounds, however we often can reject the original OLS estimate, which is important given recent evidence that many IV estimates are inaccurate, and frequently include OLS point estimates in their confidence intervals (Young, 2018).
Using summary statistics from Table A2, we can convert standardised estimates from these bounds into years of education. The effect on education of first and second-borns from having a fertility shock at the third birth, or on first to third-borns from a fertility shock at the fourth birth is estimated to be approximately 5% of a standard deviation in the developing country sample. 20 Using the standard deviation in the sample of 3.8 years, this implies an average effect of around 0.19 years of education per additional sibling at the age of 13 years (the average age in the sample). In the case of the US estimates, for the same 2+ and 3+ groups the average estimated effect based on the midpoint of bounds estimates is 8% of a standard deviation in grade retention, which equates to a marginal effect of 0.22 years of education by the age of 11 years. On average the likelihood of being reported as being in excellent health falls by 4.2% according to the midpoint of bounds following an additional birth among the same group. Overall, these are quite large effects relative to the marginal effects of different policy interventions considered in the literature (see section 5.2.4).

Conclusion and Discussion
This paper demonstrates that twin-IV estimates of the fertility-human capital trade-off tend to be biased downward on account of positive selection of women into twin birth, a problem that has not been previously recognized. We show that even partially correcting for twin endogeneity is sufficient to push estimates of the trade-off up by about 3%-5% of a standard deviation. Using partial identification to bound the effect of child quantity on child quality suggests that the true effect size may be as high as 8% of a standard deviation, though it is typically centered around 3%-5% of a standard deviation.
We conclude that additional unexpected births do have quantitatively important effects on their siblings' educational outcomes. The estimated 4%-5% of a standard deviation increase is equivalent to an additional 0.15 to 0.19 years in the classroom in the developing country sample, and estimates of approximately 8% of a standard deviation in the US account for 0.22 20 This estimate is the average midpoint if the bound estimates from the three plus and four plus samples in Table  9 and can be calculated as: 28 more grades progressed on average. As detailed in the Introduction, the implications of these findings are far-reaching, not only in terms of vindication of Beckerian theory but because they guide fertility control policies.
Any human capital costs of fertility are naturally of greater concern not only when fertility is high but also when a large share of it is unwanted. In 2015 the average number of births per woman in low income countries was five and, comparing actual with stated desired fertility, we estimate the share of unwanted births is as high as 60 per cent in some countries, with        Table 2 and include a full set of country and year of birth fixed effects. Home birth and antenatal visits are recorded only for children aged 0-4 at the time of the survey, and the standardised education score is recorded only for children aged 6-18 (of school age). Additional notes are available in Table 2. DHS sample weights are used, and standard errors are clustered by mother. *** p<0.01, ** p<0.05, * p<0.1 OLS regressions described in equation 1 are presented using developing country (DHS) and US (NHIS) data. The 2+, 3+ and 4+ samples are defined in the estimation sample section of the paper (section 4). Base controls consist of fixed effects for child's age and year of birth, child gender, mother's age at birth, and a cubic for mother's age at time of survey. For the USA sample, mother's race fixed effects are included. For DHS data, country fixed effects are also included. Additional socioeconomic controls consist of mother's education and (for DHS data) wealth quintile fixed effects, and health controls include a continuous measure of mother's BMI, and for DHS, mother's height and coverage of prenatal care at the level of the survey cluster. For USA data, we include controls for mother's self assessed health on a Likert scale. Standard errors are clustered by mother. * p<0.1; * * p<0.05; * * * p<0.01 in families with at least two births. 3+ refers to first-and second-borns in families with at least three births, and 4+ refers to first-to third-borns in families with at least four births. Panel A presents the first-stage coefficients of twinning on fertility for each group. Base controls consist of child age and mother's age at birth fixed effects plus country and year-of-birth FEs. Additional socioeconomic controls consist of mother's education and wealth quintile fixed effects, and health controls include a continuous measure of mother's height and BMI and coverage of prenatal care at the level of the survey cluster. In each case the sample is made up of all children aged between 6-18 years from families in the DHS who fulfill 2+ to 4+ requirements. In panel B each cell presents the coefficient of a 2SLS regression where fertility is instrumented by twinning at birth order two, three or four (for 2+, 3+ and 4+ groups respectively). The rk test statistic and corresponding p-value reject that the twin instruments are weak in each case. Coefficient Difference in Panel B refers to a test that the coefficient estimate on Fertility in a given model is identical to the estimate on Fertility in the base case. This test takes account of the correlation between errors in the base and augmented regression model (in the spirit of seemingly unrelated regressions), but is estimated by GMM to house the IV models estimated here. Low p-values are evidence against equality of the two estimates. Standard errors are clustered by mother. * p<0.1; * * p<0.05; * * * p<0.01  Table 6 and are described in notes to Table 6. This table presents the same regressions however now using NHIS survey data (2004-2014). Base controls include child age FE (in months), mother's age and mother's race FEs. Additional socioeconomic controls consist of mother's education fixed effects, and health controls include a continuous measure of mother's BMI, and a Likert scale measure of a mother's self-assessed health. In each case the sample is made up of all children aged between 6-18 years from families in the NHIS who fulfill 2+ to 4+ requirements for schooling variables, and for children aged between 1-18 years for health variables. The first stage results and tests of instrument strength are displayed for the regression using the education sample only. Qualitatively similar results are observed for the health sample. A description of the Kleibergen-Paap statistic and Coefficient Difference are provided in notes to  Each column and panel presents a separate regression using DHS data. Siblings ≥ 2 refers to the marginal effect of moving from 1 to 2 siblings, Siblings ≥ 3 refers to moving from 2 to 3 siblings, and so forth. Each model includes maternal age, country, survey year and child age fixed effects as well as child's gender. The regressions in columns 2, 4 and 6 are augmented with all socioeconomic and health controls described in Table 5 of the paper. Standard errors are estimated using a block bootstrap sampling each family with replacement, and for each bootstrap replication the both the regression and the constructed instruments are reestimated. First stage regressions are displayed in Table A14.  (2012) bounds are based on the assumption that twinning is positively selected and fertility is negatively selected, and twins are "less endogenous" than fertility. Conley et al. (2012) bounds are estimated as described in section 3.2 under various priors about the direct effect that being from a twin family has on educational outcomes (γ). In the UCI (union of confidence interval) approach, it is assumed the true γ ∈ [0, 2γ], while in the LTZ (local to zero) approach it is assumed that γ follows the empirical distribution estimated in each case. The preferred prior for γ (γ) and its distribution is discussed in Appendix E, and estimates for γ are provided in Table A15. Comparisons under a range of priors are presented in Figures 2-3. Each estimate is based on the specifications with full controls from Tables 6 and 7.

(c) Four-Plus Group
Note to figure A7: Histograms display the total family size of families meeting inclusion criteria for each estimation sample (two-plus, three-plus, and four-plus). By definition, the two-plus sample only includes families with at least two births, the three-plus sample only includes families with at least three births, and the four-plus sample only includes families with at least four births.     Notes: Individual sources discussed further in the body of the text. Estimates reported in each study are presented along with their standard errors in parenthesis. Parentheses marked as † contain the t-statistic rather than the standard error. with standard deviation below in parenthesis. Education is reported as total years attained, and Z-score presents educational attainment relative to birth and country cohort for DHS, and birth quarter cohort for NHIS (mean 0, std deviation 1). Infant mortality refers to the proportion of children who die before 1 year of age. Maternal height is reported in centimetres, and BMI is weight in kilograms over height in metres squared. For a full list of DHS country and years of survey, see Appendix Table A12.

A9
(3) (3)   Notes: Full output is presented from IV regressions displayed in Table 6 on health and socioeconomic controls from models denoted "+H" (adding health controls) and "+S&H" (adding health and socioeconomic controls).
Additionally, fixed effects for years of education of the mother are included in regressions though are not displayed in the interests of space. These fixed effects show a positive gradient with higher education associated with additional child education. Full notes are available in Table 6. Notes: Full output is presented from IV regressions displayed in Table 7 on health and socioeconomic controls from models denoted "+H" (adding health controls) and "+S&H" (adding health and socioeconomic controls).
Additionally, fixed effects for years of education of the mother are included in regressions though are not displayed in the interests of space. These fixed effects show a positive gradient with higher education associated with additional child education. Full notes are available in Table 7. Notes: Full output is presented from IV regressions displayed in Table 7 on health and socioeconomic controls from models denoted "+H" (adding health controls) and "+S&H" (adding health and socioeconomic controls).
Additionally, fixed effects for years of education of the mother are included in regressions though are not displayed in the interests of space. These fixed effects show a positive gradient with higher education associated with additional child education. Full notes are available in Table 7.   Kitagawa (2015), and the second row displays the p-values associated with the Kitagawa test. Baseline controls consist of mother year of birth fixed effects, continent fixed effects, child sex, and decade of birth fixed effects. Socioeconomic controls add indicators for mother's education (0 years, 1-6 years, 7-11 years, or 12+ years), and Health controls add indicators for overweight or underweight mothers, and whether the majority of births in the mother's region were attended by doctors, nurses or unattended. A trimming constant of 0.07 is used for the instrumental validity test, (as laid out in Kitagawa (2015)), and 500 bootstrap replications are run to determine the p-value.  The dependent variable in each model is each child's standardized educational attainment compared with his/her cohort. Table headers (Z-Scores and Continuous Variables) refers to the form of the independent variables at the level of the mother. All regressions are estimated by OLS, and cluster standard errors by mother. Each specification includes fixed effects for mother and child age, total fertility, country and year, and family wealth quintile. In columns 5-7 maternal year of education fixed effects are included. Columns 1-4 use standardised variables for education, BMI and height, where Z-scores are constructed comparing each mother to those in her country and survey wave.

B Data Definitions
All outcome and control variables used in principal IV and OLS analyses are described in the following table. As well as variable definitions, units and any functional forms are indicated, which refer to the way variables enter IV or OLS models.

C Testing for Equality of Coefficients Between IV Models
When estimating subsequent IV models with the progressive inclusion of controls to capture maternal selection, our point is really that column 1 ("Base") is not distinguishable from 0, while column 3 ("+S&H") often is, as this is the important thing in considering the literature and in showing that partial bias adjustment recovers the trade-off. We have nevertheless added a formal test of coefficients between IV models in all IV tables. This is added as a row called "Coefficient Difference" at the bottom of Tables 6 and 7. This computation is not entirely trivial, as these tests must take account of correlations between variance-covariance matrices of each IV regression in the style of seemingly unrelated regression. Thus, we calculate these test statistics by jointly estimating the models with GMM (seemingly unrelated regression is an Feasible Generalised Least Squares technique, and hence not suitable for IV models). To do this we form two equations which are the two models we wish to compare in the following format: Our goal is to test the equality of coefficients b 1 = c 1 . Given that we are using instruments for endogenous quantity (fertility) in each case, we can thus form the following population moment conditions which hold under the null of instrumental validity in each case (ie, replicate the specifications we are estimating in the paper): Using the sample analogues of these moments, we can then estimate the parameters b and c via GMM. Denoting the two moments as the 2 element vector g( bc), we then estimate the parameters b and c using the GMM objective function J( bc) = ng( bc) W g( bc). An unadjusted weight matrix is used which assumes that the moment conditions are independent, which replicates all parameters and standard errors from the original IV model, but now the estimates can be formally tested for equality against one-another using a χ 2 test which also considers the correlation between the observations in the two models when estimating the eventual variance-covariance matrix.

D Loosening the Linear Effect Specification of the Q-Q Trade-off
Theoretical statements of the QQ model tend to assume, for simplicity, that all children in a family have the same endowments and receive the same parental investment. More recent work (for example the theoretical work of  and empirical papers by ; Mogstad and Wiswall (2016);  relax this assumption. Among other things, this allows for reinforcing or compensating behaviours in parental investment choices . This implies allowing the coefficient β 1 to vary across children in the family. 1 Using DHS data for which we have sufficient power to split instruments, we re-estimate our regressions following the non-linear marginal fertility models of ; Mogstad and Wiswall (2016), and find that as is the case with the linear models reported in Tables 6 and 7, the inclusion of twin predictors nearly universally increases the size of the estimated QQ trade-off in non-linear models, and in some specifications, the trade-off is statistically significant. Thus the emergence of a trade-off following partial correction for twin non-randomness is not sensitive to functional form and, in particular, holds when the impact of fertility is allowed to vary by parity.
As laid out in Mogstad and Wiswall (2016), this consists of the following 2SLS procedure (for the two-plus sample): quantity sj = λ s2 twin 2j + λ s3 twin * 3j + λ s4 twin * 4j + λ s5 twin * 5j + Xλ Xs + ν sj , for s = 2, 3, 4, 5 (A5) quality ij = β 0 + β 1 quantity 2j + β 1 quantity 3j + β 1 quantity 4j + β 1 quantity 5j + Xβ X + η ij , where (A5) is a series of first stages for the likelihood effect of moving from the s th to (s + 1) th child, and (A6) is the second stage estimate of the effect of an additional child after s births on the human capital of the first born child. As the estimation sample consists of families with at least two births, twin 2j : a binary variable for a twin at the second birth, is defined for all families. However, when moving to higher birth orders, twin 3j is not defined for families with only two births. We thus follow Mogstad and Wiswall (2016) in replacing higher-order twin birth indicators with: where, as described in Mogstad and Wiswall (2016)Ê[twin cj |X j , c j ≥ c] is a non-parametric estimate of the conditional mean of the probability of twin birth in the non-missing subsample. We similarly follow Mogstad and Wiswall (2016) in considering family sizes up to 6 children. The above specification (A5 and A6) is estimated for the two-plus sample, however we also estimate analogous specifications for the three-plus sample, and four-plus sample, where in each case we only consider the marginal impacts of fertility at birth orders greater than the birth orders of the children included in the estimation sample.
As our interest is in examining the impact of non-random twin births, we estimate the above specifications in two circumstances: the first, following exactly the procedure laid out in Mogstad and Wiswall (2016) where twins are assumed to be exogenous, and the second where we additionally control for observable health and socioeconomic predictors of twins in A5 and A6. These results are presented and discussed in Section 5.2.3 of the paper.

E Estimating Values for γ
We propose a number of methods of arriving to a non-arbitrary prior regarding γ in the Conley et al. (2012) method, where γ is the violation of the exclusion restriction when using twins as an instrument in the QQ model. From equation 4, γ represents the conditional effect of being born of a twin mother on child quality: In practice, bounds identification based on γ only pushes the identification problem back by one step, as consistent bounds rely on having an unbiased estimate of γ, which is not trivial. In this appendix we first discuss a proposed manner to causally estimate γ, and then present a number of consistency checks based on the data used in the QQ models of the paper which support estimated values of γ So as to obtain a consistent estimate of γ, albeit from different samples, we exploit quasi-experimental changes in maternal health (health j ) and use these to obtain consistent estimates of the impact of maternal health on (a) child quality and (b) the probability of a twin birth. We then 'scale' the first by the second. First, we estimate ∂quality ij ∂health j X = φ q .
Under the assumption that the change in health is quasi-experimental, this is a causal estimate of a 1 unit change in health j on child quality. Since γ is the effect of maternal health scaled by the difference in health between twin and non-twin mothers we also estimate: With these two parameters in hand, we obtain a causal estimate of γ as: As it involves the estimated quantities φ q and φ q , γ will be subject to sampling uncertainty: γ = φ q × φ t . Thus, the estimate γ will have a distribution. If we can estimates both γ and its distribution, this gives us the consistent prior for the full distribution of γ required in Conley et al.'s LTZ approach. We estimate the distribution using resampling (bootstrap) methods, using which we can compare the analytical distribution with a series of known distributions 2 , or indeed use the analytical distribution of γ directly in the bounds estimate of β 1 . 3 We provide a summary of the assumptions underlying our bounds estimates and evidence in their favour in appendix E.4 and a full description of the resampling process in appendix E.5. Implementing this approach imposes fairly strong data requirements. We require data that capture differential exposure of women to a quasi-experimental change in their pre-pregnancy health, together with measures of the quality of their children. In addition, we need information on the prevalence of twin births in this sample of women. In the following subsections, we describe two studies, one set in the United States, and the other in Nigeria, which offer a large and representative sample of women with birth data and intergenerational linkage, and in which we observe the incidence of a quasi-experimental shock to maternal health. In the United States the shock is the introduction of antibiotics in 1937 and in Nigeria it is the Biafra war that raged through 1967-1970. We show how we exploit these cases to estimate γ and its distribution. We observe that bounds estimates of this type are necessarily case specific (see, for instance, the examples provided in the Conley et al. paper) so, although our approach is of general interest in suggesting a process for bounding when violation of the exclusion restriction is small, the estimates produced here are only representative of the cases examined.

E.1 Estimating γ: A case from the United States
The first antibiotics, sulfonamide drugs, were introduced across the United States in 1937, following clinical trials in London and New York and there was nothing else on the stage until penicillin was introduced during the Second World War. There was immediate and widespread uptake and the drugs were hailed as a "miracle" (Lesch, 2006). Their arrival was associated with a sharp drop in a range of infectious diseases that were treatable by these drugs (Jayachandran et al., 2010). In particular, pneumonia, the leading cause of death among children after congenital causes, fell sharply and this decline was largest among infants . Although there are no direct measures of the adult health of individuals exposed to the antibiotics at birth, it is plausible that infant health improvements persist and generate improvements in adult health; some evidence of this is in (Almond et al., 2011;Butikofer and Salvanes, 2015;Hjort et al., 2016;Bhalotra et al., 2015).
What is pertinent for our purposes is whether any improvements in the adult health of women are such as to influence the quality of their children. 4 We therefore estimate this reduced form using the identification strategy of  but with outcomes of the children of exposed women rather than the outcomes of the women themselves as dependent variables. Identification exploits the timing of this shock to health at birth together with the fact that the largest drops in pneumonia occurred in states with the highest initial burdens of disease. This assumes that states with high vs low burdens of pneumonia did not have different trends in the outcomes before the introduction of antibiotics. To demonstrate that this is the case we estimated an event study (see Figure A12). 5 Let m signify the mother, and m + 1 signify her children. Using the United States micro-census files, we estimate: where φ q 1 is an estimate of the change in child quality associated with the mother's exposure to antibiotics in her infancy. The pre-intervention mean pneumonia mortality rate at the state level, s, is denoted baseP neumonia m s and interacted with (P ost t ), which indicates birth cohorts 1937 and after. We control for race-specific fixed effects for census year t, mother's birth cohort c, and mother's birth state s as well as state-specific linear time trends. The coefficient of interest is of similar size and significance conditional upon the state and time varying controls (health and education infrastructure, state income) and upon a vector of rates of mortality from control diseases (diseases not treatable with sulfonamides) interacted with the indicator post.
The second step is to estimate the association of the health shock experienced by women at their birth with the probability that they have a twin birth. This is an experimental analogue of the twin non-randomness associations we present in the paper. We take the conditional average rate of baseline pneumonia in the state of residence for all women who give birth to a twin, and the similar conditional rate for non-twin mothers, using the same controls as in equation A8. In other words, we calculate In view of our findings related to twin selection, our expectation is that women with lower exposure to pneumonia at birth will be more likely to have twins, and hence φ t 1 < 0.
As discussed, with these two quantities in hand, we can estimate γ by taking their product: We can plug this into our estimates of the bounds on β 1 using following Conley et al. (2012), as described earlier.

E.2 Estimating γ in Nigeria
Since we shall proceed to analyse alternative estimators of the QQ trade-off in developing countries and not only in the US, we obtained an estimate of γ from Nigeria. Here, we exploit the exposure of individuals through their growing years to the Nigerian civil war. This was the first modern war in sub-Saharan Africa after independence and one of the bloodiest. It raged in Biafra, the secessionist region in the South-East of Nigeria from 6 July 1967 to 15 January 1970, killing between 1 to 3 million people and causing widespread malnutrition and devastation. The war created a virtual famine in the Southeast, where it was fought, and the effects of under-nutrition were potentially reinforced by trauma and the increased incidence of infections. Akresh et al. (2012Akresh et al. ( , 2016 investigate long run effects of war exposure, exploiting the differential exposure of the Christian Igbo community resident in Biafra relative to other ethnic groups (in other states), interacted with the timing of the war. They show that women exposed to the war were shorter as adults, and more likely to be over-weight. As height and obesity are measures of health, they thus establish that the war was a shock to maternal health. We use their identification strategy to estimate impacts on children's education of the mother being exposed to the war in utero, using a continuous measure for the number of months exposed.
The estimated equation is: quality m+1 ites = α + φ q 2 war m te + α t + θ e + λ s + µ e t + u ites (A10) for woman i of ethnicity e born in year t and state s. The indicator of quality is a z-score (standardized by age and gender) for the years of education of children in generation m + 1 and φ q 2 is the reduced form effect on this of the maternal health shock created by the war. Analogous to the US case, we thus estimate φ t 2 = war twin=1 − war twin=0 = ∂war ∂twin X , so that we can estimate γ, the twin-mediated effect of maternal health on child-quality as:

E.3 Estimated Values for γ in US and Nigeria
The United States. In panel A, we use quasi-experimental variation in the exposure of women to antibiotics in their birth year in early twentieth century America to estimate impacts of mother's health on children's education, cast as a Z-score, with the standardization using the birth cohort distribution. Following equation A8 (and Bhalotra and Venkataramani (2014)), we estimate that the reduced form effect of the mother's exposure is an increase in the child's completed education of 4.97% of a standard deviation, or approximately 0.15 years of education. 6 This estimate is the quantity φ q 1 in equations A8 and A9. In the second column, we show estimates that imply that, conditional upon health and fertility controls, mothers who produce twin births are, on average, in states with 12.5% lower rates of pneumonia. This augments the evidence presented in twin non-randomness tests of this paper, adding a further case of twin births being a function of health conditions.
Following equation A9, in column 3 we interact φ q and φ t to form a consistent estimate for γ of 0.62% of a standard deviation. Bootstrapping this distribution results in an estimated variance of 0.0027. The empirical distribution estimated from 100 bootstrap replications is displayed in Figure A13a, overlaid with an analytical normal distribution with the same mean and variance. When comparing our estimate of γ to IV estimates discussed in section 5.2, we see that the direct effect of having a (healthier) twin mother on child quality (the violation) is considerably smaller than the point estimates of the effect of fertility on child outcomes (the parameter of interest). While it is reassuring that the violation of the exclusion restriction is estimated as small, in that it implies that the instrument is "close to" being exogenous (in Conley et al. (2012)'s terminology), the evidence we provide shows that it is nevertheless sufficient to generate substantively different conclusions regarding the QQ trade-off.
Nigeria. We repeat the procedure for estimating the violation of the exclusion restriction using quasiexperimental variation in the mother's foetal exposure to the Biafra war that was fought in Nigeria in 1967-1970. Results are in panel B of Table A15. The first column presents an estimate of φ q 2 from equation A10. Children of mothers exposed to the war in utero have 1.54% of a standard deviation less education, equivalent to 0.052 years (compared to children of mothers unexposed to the war in utero). 7 The second column shows that, on average, twin mothers come from states and cohorts that are 26.7% less likely to have suffered war. Together these estimates imply a positive estimate for γ of 0.4% of a standard deviation in education, not dissimilar to the value estimated using a different shock to maternal health in early twentieth century America. The bootstrapped distribution of γ based on 100 replications is displayed in Figure A13b (bootstrap variance 0.0022).

E.4 Assumptions and Evidence Underlying the Calculation of Plausibly Exogenous Bounds
The calculation of bounds using Conley et al. (2012)'s plausibly exogenous methodology relies on a number of assumptions relating to the exclusion restriction. We provide a full list of these assumptions, their precedence (be it from Conley et al. (2012)'s methodology or our extension to estimating γ and its empirical distribution), and supporting evidence for each.
1. There exists prior information that implies γ (the violation of the exclusion restriction) is near 0 but perhaps not exactly 0. Precedence: Conley et al. (2012), p. 262. Evidence: Tables 1-2 of our paper documents that twins occur more frequently to healthy women. This renders the exclusion restriction on which the twin-IV rests invalid if, in addition, earlier children of healthy women are higher quality children. Nevertheless, it is unlikely that the violation of the exclusion restriction is very large given that maternal health is only a small part of the production function of child quality.
(a) The average value of γ for a particular context can be estimated using a single maternal health shock as a mediator to examine both elements of the violation of the exclusion restriction (twin non-randomness and the effect of maternal health on child quality). Precedence: This paper, equation A7. Evidence: the particular maternal health shock examined is a common factor in both effects. In the simplest case, if we scale a maternal health shock by a fixed parameter (for example  Akresh et al. (2012). The estimate of γ is formed by taking the product of the column 1 and column 2 estimates. A full description of this process, along with the non-pivotal bootstrap process to estimate the standard error of γ is provided in this Appendix.  Note to Figure A12: Graph replicates specification (A8), however now interacting baseP neumonia with each mother's birth year, rather than a single P ost dummy starting from 1937. Each coefficient and confidence interval displays the differential effect of a child's mother being born in a high-or low-pneumonia state by birth year surrounding the sulfa reform. The year preceding the arrival of sulfa reform is omitted (1936) and post sulfa estimates and confidence intervals represent the differential impact of sulfa drugs on second generation (educational) outcomes of children of affected women. Standard errors are clustered by state.
A30 Figure A13: Notes to Figure A13: The empirical distribution is generated by performing J=100 bootstrap replications to estimate φ t and φ q for each of Nigeria and USA (see complete discussion in section 3.2). The overlaid analytical distribution in each figure is a normal distribution ∼ N (µγ, σγ). The estimates for φ t and φ q and γ are displayed in Table A15.
considering the effect of being exposed to a 1% reduction in rates of pneumonia or the effect of being exposed to a 10% reduction in pneumonia) these scale effects will be perfectly canceled out in the numerator and denominator of equation A7. To the degree that a large or small health shock impacts maternal health and rates of twinning by a similarly large or small amount, the particular mediator used will produce an identical value for γ. This assumption would be violated if different health shock have different relative effects on twinning and on child quality, for example a shock which is particularly important for child quality but not for twinning. We return to this point in the caveat below.
(b) The true distribution of γ around its mean can be approximated by a resampling algorithm. Precedence: Conley et al. (2012), p. 265. Evidence: Conley et al. (2012) demonstrate that a simulationbased estimate for the confidence intervals of β can be generated based on resampling of the underlying distribution of interest. In this paper we propose the use of an analytical distribution. This follows if we view our sample of data as the population, and resample from the population, as is typical in bootstrap methods. In both cases (USA and Nigeria) our resampling is based on a representative sample of the full population of mothers, leading to a valid bootstrapped distribution.
Caveat: If the above assumptions are not met, particularly assumption 2 or any of its parts, our estimate of the bounds on β will no longer be correct. However, as Conley et al. (2012) point out: "It [this method] will produce valid frequentist inference under the assumption that the prior is correct and will provide robustness relative to the conventional approach (which assumes γ ≡ 0) even when incorrect." In the case that the above assumptions are not correct, we provide a full set of bounds over a wider range of values in Figures 2 and 3, to determine the robustness of bounds estimates to (even non-conservative) changes in assumptions of γ.

E.5 Bootstrap Confidence Intervals
The methodology to estimate γ in equations (A9) and (A11) is described in previous sub-sections of this Appendix. In the case of Conley et al.'s UCI approach, this estimate is then sufficient to produce bounds on β 1 , assuming that: γ ∈ [0, 2γ]. We scaleγ by the factor of 2 in order for this value to fall precisely in the middle of the range. Conley et al. (2012) provide a similar example to calculate the returns to education using the UCI approach. In the case of the more precise LTZ approach (our preferred method) the logic is similar, however now we must form a prior over the entire distribution of γ. Calculating the variance of γ is not as straightforward as using the variance-covariance matrix corresponding to each of the estimatesφ t andφ q . In this case however we can use bootstrapping to calculate J replications ofφ t ×φ q , and from these estimates construct an estimated distribution ofγ, which allows us to determine our prior for the distribution of γ. From this empirical distribution, we observe the estimated mean and standard deviation, and finally test whether the distribution is normal using a Shapiro Wilk test for normality. We also use Kolmogorov-Smirnov tests for equality of distributions to test whether the distribution is more likely to be log normal, uniform, and a number of other known analytical distributions. In order to do this, we first estimate the empirical distribution as described previously. We then observe the meanμ and the standard deviationσ, and run a one-sample test to determine whether the observed empirical distribution is is significantly different to each analytical distribution N (μ,σ 2 ), U (μ,σ 2 ) or ln N (μ,σ 2 ).
Estimates of the full distribution of γ are presented in Figures A13a and A13b. These are the estimated γ j from j ∈ {1, . . . , 100} bootstrap replications for γ in Nigeria and the United States. In all cases, when the underlying empirical distribution is tested for equality against the overlaid analytical distribution (uniform, normal, log normal, χ 2 ), the normal distribution provides the best fit of the analytical with the empirical distribution. 8 However, the underlying distribution appears to not be perfectly normal, and it appears doubtful that this would be the case asymptotically. Fortunately, Conley et al. (2012) describe a simulation-based estimation method to calculate γ in the case of a non-normal distribution for γ. We have followed this methodology using the empirical distribution calculated bootstrapping for γ. This code has been publicly released as plausexog for Stata (Clarke, 2014). The simulation-based estimation procedure is described fully in Conley et al. (2012) p. 265 as a five step algorithm. The procedure consists of taking repeated draws from the variance-covariance matrix estimated using IV with the plausibly exogenous instrument, and in each case adding to it a draw from the distribution of γ, scaled by a quantity which depends on the strength of the instrument. Conley et al. refer to the underlying distribution of γ as F , and the scale parameter as A, where A = (X Z(Z Z) −1 Z X) −1 (X Z). These repeated draws then lead to a large number of estimates for β, the parameter of interest, and a 95% confidence interval is taken by forming [β − c 1−α/2 ,β + c α/2 ], where c are percentiles of the distribution of simulated estimates.
Thus, as well as estimating the LTZ case where we assume that γ is distributed ∼ N (µγ, σ 2 γ ), we can estimate a version fully utilizing the bootstrapped distribution ofγ described in the previous sub-section. In this case, we use as F , the distribution of γ, the empirically estimated distribution of γ. The simulation based algorithm then consists of taking b ∈ 1, . . . , B draws from the empirically estimated F , as well as B draws from the variance-covariance matrix, and defining the 95% confidence interval based on the 2.5 and 97.5% quintiles of the resulting simulated values for β.