Early Stimulation and Nutrition: The Impacts of a Scalable Intervention

Abstract Early childhood development is becoming the focus of policy worldwide. However, the evidence on the effectiveness of scalable models is scant, particularly when it comes to infants in developing countries. In this paper, we describe and evaluate with a cluster-Randomized Controlled Trial an intervention designed to improve the quality of child stimulation within the context of an existing parenting program in Colombia, known as FAMI. The intervention improved children’s development by 0.16 of a standard deviation (SD) and children’s nutritional status, as reflected in a reduction of 5.8 percentage points of children whose height-for-age is below -1 SD.

intervention we designed included two complementary curricula. Both used similar components, actions, and activities to promote better maternal child rearing practices. These included making the mother the agent of change and empowering her to improve her child's development by demonstrating the use of age-appropriate play materials and activities, by providing opportunities to practice with them, and by providing supportive feedback. The program also aimed at training mothers in sensitive and responsive parenting and appropriate behavior management, and encouraged positive mother-child interactions and child maltreatment prevention.
Most of the program content was delivered through the group meetings as they were held on a weekly basis. In addition to being spaces where to demonstrate and practice the use of age-appropriate play materials and language activities, the groups provided opportunities for discussing and practicing effective child rearing skills and positive interactions with children with other caregivers, sharing experiences, group problem-solving, as well as opportunities for social support. Group meetings also provided the opportunity for mothers to discuss how play activities promoted children's development and show them how to make simple toys so that each family could set up a toy library for home use. Group meetings were one hour long. An average of 5 mothers attended each session (min=1, max=15, SD=2.6).
The home visits were delivered monthly and provided the opportunity to introduce activities that were more dif cult (i.e., more speci c to a child's developmental level) in the context of the group (such as puzzles and matching activities), additional language activities, and speci c ideas on how to use routine home activities to promote child development and how to identify materials in the home that could be used to promote child development. Home visits were, on average, one hour long.
Separate group meetings were offered for pregnant and lactating women with children up to 6 months, mothers with children 6-11 months, and mothers with children aged 1-2 years, and mothers were asked to attend the meeting that corresponded to her child's age. However, in practice, this did not always occur, and, in anticipation, the curriculum had been designed so that it could be delivered to groups with children over the entire age range. To this end, the play and language activities were divided into age bands (birth-5 months, 6-11 months, and 1-2 years) by level of dif culty. We expected mothers of children 6-24 months to attend four meetings per month and pregnant and lactating women with children up to 6 months to attend two meetings per month.
Each group session was structured in six different moments: arrival and free play; feedback from the previous group session (10 minutes); song (5 minutes); demonstration and practice of the age-appropriate play and language activities for the week with materials that will be taken home (30 minutes); discussion around a parenting theme or activity (15 minutes); review of the session to ensure that mothers understand the activities, and commitment to practice with children at home (10 minutes); and in closing, they share a snack. The themes for discussion during the group meetings included issues such as the importance of spending time playing with the child, praising the child, talking to the child, things to do at bath time or mealtimes, learning to trust, understanding the child's feelings, teaching the baby about her environment, and child behavior.
Similarly, each home visit consisted of (i) greeting and discussion of any issues, (ii) feedback from the previous home visiting session, (iii) song, (iv) introduction of new play and language activities (including how to integrate into everyday routines, (v) nutrition message, and (vi) a review of activities to be conducted over the next month.
The curriculum included discussion topics or key parenting messages, ageappropriate activities to promote child development using the play materials, as well as everyday activities to encourage adult-child interactions. It was speci cally designed to increase the focus on language development with respect to the original RU curriculum. For example, (1) the language activities were designed to be more structured and to involve demonstration and practice, in the same way as the other play activities, rather than being based on discussion, and (2) we included the use of materials in the language activities to make them more concrete (i.e., when encouraging mothers to label and talk about things in the home environment, we used relevant objects, such as a cup, comb, chair, during the session for demonstration and practice). It was very rich in play materials-books, pictures to talk about, home-made toys, puzzles and building blocks-which were used during home visits and in the group meeting. It also included a set of nutrition cards relevant to the children's ages that were discussed with the mother during each home visit. The complete kit of materials had a cost of $27 US per child per year. 1 In addition to the set of activities and materials, the quali cation of the FAMI program also included a training and coaching component (pre-and in-service training) to support and maintain the quality of home visits and group meetings. Shifting away from a supervision model, the new approach consisted of a team of tutors with degrees in psychology and social work, who provided the initial pre-service training and then continued to provide in-service training and support during the implementation period. Tutors, trained and supervised by the research team, were in charge of training FAMI mothers. Pre-service training was provided sequentially by town, where all FAMI mothers were trained simultaneously. Average training time was of 3.5 weeks and 85 hours, although total training time varied by town depending on the number of FAMI units. Speci cally, towns with less than 5 FAMI units received 75 hours of training in 3 weeks; towns with 6 to 9 FAMI units received 100-125 hours for 5-6 weeks; and towns with more than 10 FAMI units received 150-175 hours of training over 6-7 weeks. The training involved demonstration, practice and feedback in running the group sessions and conducting the play and language activities with mothers and children, and toymaking. The one-time cost of the pre-service training per FAMI provider was of about $113 US or $11 US per child.
Tutors also coached FAMI mothers continuously throughout the duration of the intervention. In each supervisory round, they observed one group session and one home visit, and provided feedback to the FAMI mother. Each tutor was in charge of 5 towns and 19 FAMI mothers, on average, and hence met with a FAMI mother every 6 weeks on average. Whenever possible, they also facilitated a group meeting of FAMI mothers in each town to discuss and share positive experiences and challenges and engage in problem-solving. The tutors were supervised by an intervention supervisor-the eldwork manager, a member of the research team-who conducted visits with each tutor every 2 months.
Training costs included: tutors' salaries (plus fringe bene ts), the eldwork manager's salary, traveling expenses of tutors and supervisor, the cost of the tutors' training by the research team (accommodation, transportation, food, trainer's traveling expenses and accommodation, materials), and incentives offered to FAMI mothers for their participation ($14 US per FAMI). This was costed for 5 months, which was the total duration of training (for both, trainers and FAMI mothers). Intervention xed costs, such as this one, were amortized over 171 treated FAMI units and assuming an average FAMI size of 10 children. Mentoring/coaching costs were similar to training costs and included: tutors' salaries, the eldwork manager's salary, and travel expenses for both, tutors and supervisor. The difference is that mentoring is a recurrent cost for every month in which it took place. Overall, the cost of coaching was $82 US per month per FAMI provider or $8 US per child per month. Coaching is provided during 11 of the 12 months of the year which amounts to a total cost of $88 US per child per year.
In addition to the introduction of the early stimulation curriculum, the intervention also included a nutritional component, which comprised the delivery of a monthly nutritional supplement to FAMI participants, and psychoeducation around feeding and nutrition during group meetings and home visits. The nutritional supplement corresponded to 35% of daily calorie intake requirements for pregnant women, breastfeeding mothers, and children younger than 2 years of age (for 30 days). The cost of the package is $26 US per month including shipping costs. It contains tuna, sardines, canola oil, iron-forti ed whole milk, beans, and lentils. The unenhanced version of the program delivers nutritional supplementation in the amount of $7. That means that the additional cost associated with this intervention is $19. The nutritional supplement is delivered for 11 of the 12 months of the year. The nutritional supplement is delivered for 11 of the 12 months of the year, which results in a total cost of $209 per child per year. In terms of educational contents, we developed a cooking book that takes into account the socioeconomic characteristics of households in our sample, brochures on food-handling and classi cation, and 19 nutrition cards that were discussed with the mother during each home visit. Mothers received a nutrition card relevant to their child's age at these monthly home visits. The topics covered included things like breastfeeding, bottle-feeding, breastmilk extraction and storage, weaning, hygiene, nger foods, menu ideas, mealtimes, and chatting while feeding.

B.1. Costs
The intervention we study costs about $322 per year per child plus $11 one-off cost per child for FAMI pre-service training. The cost of the unenhanced FAMI program is about $327 per child per year. The pedagogical enhancement (excluding the nutritional supplement) corresponds to approximately 35% of the cost of the unenhanced version of the program. This is equivalent to 1.5 monthly minimum wages per child per year, or 2.5 monthly minimum wages per year including the nutritional supplement. For comparison, the cost per child per year in center-based childcare in Colombia is approximately $1,100 or 4.4 monthly minimum wages per child per year. 2 In Table  B.1 we compare the costs and impacts of other interventions recently implemented in Colombia, with FAMI. Bernal (2015) studies the impact of vocational training of the women running the family nurseries considered. She reports a sizeable impact at a low cost per child. Bernal et al. (2019) consider the transfer of children from home-based daycare services offered in the provider's own home to large childcare centers and nd virtually no impacts at a very large cost. 3 Finally, Andrew et al. (2019) study the impacts of (1) 2. Cost computations are not adjusted to consider the possibility of crowding in different programs since the intervention we evaluate did not affect the probability of leaving a FAMI to join the new program MF (although it did affect the probability of leaving a FAMI to join an alternative ECD service).
3. The cost reported in the table corresponds to the difference in the cost per child/year in a childcare center and the cost in a family nursery. targeted pedagogical improvements to center-based care in large cities and (2) staf ng of these centers with nutritionists and psychologists. The impacts are comparable to ours at a slightly higher cost for the pedagogical component. Incidentally, the hiring of professional personnel in centers had no effects on children's cognition.
This summary highlights the importance of enhancements to what is known in the specialized literature as process quality (such as the integration of a structured curriculum and improved interactions between caregivers and children supported by coaching and mentoring) with respect to changes and improvements in the so-called structural quality alone (such as infrastructure, as in Bernal et al. (2019); or staf ng, as in the second intervention studied in Andrew et al. (2019)). In particular, the former seems to have more cost-effective impacts with respect to the latter. In sum, the evidence we have presented shows that it is possible to gradually improve the quality of nationwide programs at scale in a way that is affordable, while maintaining quality and with a reasonably sized impact on children's developmental outcomes.

B.2. Scalability
When thinking about the scalability of the FAMI improvement, we consider the possibility of using the already existing supervision infrastructure at ICBF. The national ICBF of ce works with 33 regional of ces and 203 local (municipality-level) of ces. Each local of ce has three different teams in charge of the main tasks. One of the teams, called the community team, is composed of 1 nutritionist, 2 pedagogues, 1 sociologist, 1 social worker and 1 educator. This team oversees supervision and monitoring of ECD services. The other two teams are responsible for child protection services and intra-household legal issues. Supervision and monitoring of ECD providers is through regular on-site visits structured around a checklist. This checklist is based upon items that we would consider structural quality features, such as, the physical characteristics of the center, the cleanliness of bathrooms, furniture, etc. It does not cover process quality aspects related to the quantity and nature of the interactions in the classroom (Bernal 2015). A bad evaluation may result in the closure of a center or the need to improve speci c aspects to be re-checked within a certain period.
Based on our ndings, we consider it would be extremely useful and effective to replace this type of supervision for one that focuses on process quality and uses a re ective approach that encourages best practices. As shown, this could be done by promoting the use of a structured curriculum and session guidelines that identify key features-both in content and form of delivery-that are relevant for child outcomes. These items could be assessed by supervisors, who could provide constructive feedback rather than penalties on the basis of their observations. In this way, continuous improvement would be supported.
There are close to 14,600 FAMI units countrywide. We think it would be possible to train 3 out of the 6 community team members (1 sociologists or 1 educator plus 2 pedagogues) in each local of ce. The quality and background of these team members is similar to those of the tutors we used in the intervention we evaluated. Each of these supervisors would be in charge of 24 FAMI mothers-a similar caseload as the one we implemented and tested. Similarly, one person in each of the 33 regional of ces could be trained by the national of ce to become trainer of trainers/supervisors.
In our pilot, one senior member of the research team trained a young psychologist in the curriculum, our eldwork manager). Both trained the 9 tutors during three weeks full-time. After that, the senior researcher was sporadically engaged, but mostly to supervise and provide feedback to the eld manager. Each tutor trained approximately 20 to 23 FAMI mothers in sessions that were also 3 weeks long, approximately.
We think it would be feasible to implement a similar cascaded training and supervision process in a potential scale up. Remote online training could be used as a more ef cient way of providing pre-and in-service training at the rst level of the cascade. Practices by pairs could be supervised online and practices with children could be videotaped for the trainer to provide feedback on the recordings. If required, part of the training could be in-person.
Members of the research team have experience with similar at-scale training strategies for other purposes. In particular, in 2017 we piloted a training scheme that combined presential and online activities for staff in local of ces to routinely collect data on child development (the TVIP test) nationwide. Even if connectivity was particularly problematic at the time, this constraint might be easing as digital services expand in response to the COVID-19 pandemic.

Appendix C: Study Design: Power Calculations and Final Sample
Power calculations assumed program effects of 0.25 of a SD relative to the control on the Bayley-III. These were obtained using an average of 4 FAMI units per town and 4 children aged 0-12 months of age per FAMI. We assumed an intra-class correlation within towns of 0.04 (in the Bayley-III scale and conditional on observables), based on the data in Attanasio et al. (2014) for a similar study in Colombia. This sample design provided 95% power at the 5% signi cance level, allowing for an attrition rate of 10%.
Towns in the nal sample had an average of four FAMI units (SD of 2.3, range between 1 and 13), which translates in a total of 171 FAMI units that received treatment and 169 FAMI units that remained as control. Figure C.1 presents the c-RCT ow chart, which shows how the nal study towns were selected (Figure C.1). Final C.2 depicts the nal geographic location of the sample. F C.1. Study's ow chart. a Once in the eld for data collection, we realized some towns did not have any FAMI units as they had made the transition to other public parenting programs (Modalidad Familiar or MF). b Towns in the list of 39 towns excluded initially from the sample, were randomly ranked and used as replacements. However, we did not have enough replacement towns in all randomization strata.

Journal of the European Economic Association
Preprint prepared on 13 January 2022 using jeea.cls v1.0.

Appendix D: Construction of Bayley III-factor and Parental Investment
As mentioned in Section 3.1, we used Bayley-III age-standardized cognitive, receptive and expressive language, and ne and gross motor to measure child's development. For parental investment, we used the number of magazines, books, or newspapers in the home, the number of toy sources, the number of varieties of play materials in the home, and the number of varieties of play activities the child engaged in with an adult over the three days before the interview. As these are noisy measures of child development and parental investment we follow Cunha and Heckman (2008) and Heckman et al. (2013) and implement a dedicated measurement system that links each observed measure to one latent factor.
We de ne M k as the number of measures of the k-th latent factor (i.e., child development or parental investment), and m j k as the j -th measure of the k-th latent factor. Assuming each measure is additively separable in the logarithm of the latent factor we specify: where the terms j k are the intercept scaled to zero,˛j k are the loadings, Â k the kth latent factor, and " j k are the mean zero measurement error terms assumed to be independent of the latent factors and from each other. This speci cation assumes that the measurement system is invariant across treatment status.
As the latent factors are unobserved, they have no natural scale and identi cation requires normalizations (Anderson and Rubin 1956). Then, we set the scale for parental investment by setting Var .ln Â PI / D 1, while for the child's development factor by setting the loading of the Bayley-III cognitive scale to one, that is˛1 CD D 1. Finally, we set the location by setting E OEln Â k D 0 for k D ¹CD; P I º, where CD and PI, refer to child development and parental investment, respectively.
In Table D.1 we look at the fraction of the variance in each measure that is explained by the variance in signal. All measures are far from having 100% of their variance accounted for by signal, which illustrates the usefulness of the latent factor approach. Attrition. In Table F.1, we analyze attrition between baseline and follow up. The table indicates that children in treated FAMIs were more likely to be lost at follow up, although the size of this impact is only marginally signi cant, both with and without controls for a number of baseline characteristics. In the Table we report estimates obtained with a Linear Probability Model.
Compliance. In Table F   Notes: * p < 0.10, ** p < 0.05, *** p < 0.01. Median= 21 activities (groups sessions and home visits). n1 and n0 are the number of observations above and below the median, respectively. Standard errors clustered by town in parenthesis. a The wealth index was computed as the principal component of a set of dichotomous variables that describe characteristics of the household, ownership of durable goods, and access to public utilities. b Factor score of FCI items.

Appendix G: Main Impacts (ITT) Corrected for Attrition
In Table G.1, we report the estimates of the impact we obtain by modeling attrition as a selection process. The identifying assumption of the impact results is that the quality and motivation of the interviewers (randomly assigned to both treatment and control towns) does not affect the outcome of interest while determining the probability of attrition. We nd that interviewers dummies are a signi cant determinant of attrition and that the results on the impacts of the program are virtually unaffected by nonrandom attrition. Notes: * p < 0.10, ** p < 0.05, *** p < 0.01. Con dence interval in parenthesis for twotailed tests. Standard errors clustered by town. p-values are computed using Romano-Wolf (RW, Romano andWolf (2005, 2016)) step-down procedure. We consider 3 hypotheses for children's outcomes. Exclusion restrictions: interviewer xed effects at baseline and assigned interviewer xed effects at follow-up. First stage F-stat=11.24. Covariates: child's gender, an indicator of high household wealth index, maternal PPVT score, teenage mother, an indicator of high municipality population, previous attendance to a child care center, department xed effects, and baseline weight-for-age and height-for-age Z-scores. Bayley-III factor is the factor score of the age-standardized Bayley III scales. ASQ:SE Total Score is the agestandardized ASQ:SE score.

Appendix H: Evaluation of an Integrated Intervention Targeted at Deprived
Pre-School Children in Rural Colombia Pre-Analysis Plan.

H.1. Introduction
This document outlines a pre-analysis plan-study design, hypothesis to be tested, and data and speci cations to be used-for evaluating the impact of the integration of a structured curriculum (both, home visits and group meetings) to promote young children's development into the FAMI (Hogares Comunitarios -Modalidad Familia, Mujer e Infancia) parenting programme for disadvantaged families in rural Colombia. The program is being implemented by the research team in cooperation with the National Family Welfare Agency (Instituto Colombiano de Bienestar Familiar -ICBF). The intervention and evaluation have been funded by Grand Challenges Canada (GCC) and the Fundación Éxito (FE). Baseline data collection took place between August and November 2014. Follow-up data will take place between April and July 2016. The intervention ran from September 2014 through March 2016. This plan has been written up prior to follow-up data processing, serving as a pre-commitment for subsequent analysis. This document is structured as follows. Section 2 reviews the intervention and evaluation design. Section 3 enumerates the hypotheses to be tested as part of the study and the data we will use to test them. Finally, Section 4 outlines the empirical speci cation(s) to be used in analyzing the data and other data management issues.

H.2. Overview of the Study: Interventions and Evaluation Design
Hogares Comunitarios -Modalidad Familia, Mujer e Infancia (FAMIs): are smallsized community centers located in areas of high social and economic vulnerability in semi-urban and rural areas of Colombia where pregnant women and parents of children younger than two years of age receive training regarding parental practices including family relationships, pregnancy, breastfeeding, nutrition, health and the upbringing of young children. The program targets socioeconomically vulnerable pregnant women, nursing mothers and parents of children less than 2 years of age. Program eligibility is de ned by the national proxy means test (PMT) known as SISBEN, which classi es households into socioeconomic vulnerability levels based on a household survey. A front line worker known as Community Mother (MC) works 80 hours per month for the program. In particular, she devotes 32 hours for parental training in group sessions; 20 hours of home visits (minimum 1 visit per family per month); 8 hours of training for the MC; and 20 hours for planning activities, documentation and transportation times. Group sessions are held separately by age subgroups according to the child's age: children from 0 to 5 months old, children from 6 to 11 months old, 1 to 2 years old children and pregnant women. Each FAMI unit has an average of 12 to 15 bene ciaries.
The FAMI program has two main components: i) the provision of nutritional supplement -that should cover 20 to 25% of daily nutritional requirements of the child or the pregnant women; and ii) training of bene ciary families on parental practices and child development since gestation and up to age two, particularly regarding nutrition, socioemotional development, health, maternal health, and early stimulation through group sessions and one-on-one home visits.
Our study offers a rigorous evaluation, by Randomized Controlled Trial (RCT), of the short-term impacts of the integration of a structured curriculum (both, home visits and group meetings) to the FAMI program and the addition of rigorous training and supervision protocols for front line workers.
H.2.1. The Intervention. Based on a rigorous study on the weaknesses of the program and a small pilot on the improvements that were to be developed, the team designed an integrated upgrade of the FAMI program. The upgrade consists of 3 components. First, the implementation of a structured but feasible, exible and effective curriculum (both for home visits and group meetings) focused on encouraging mother-child interactions and maternal self-ef cacy, teaching mothers how to promote their children's development and promoting maternal self-esteem and mental health. This component is complemented by the provision of pedagogical materials such as puzzles and books, materials for home-made toys and toy making workshops during group sessions. Second, the improvement of MCs' training, provided by professional tutors trained by the research team, in order to guarantee the delity of program implementation and the addition of a supervision and coaching protocol for front line workers to be delivered throughout the duration of the intervention. Third, the delivery of an additional nutritional supplement (increased intake of calorie, protein, vitamins and minerals), complemented with psychoeducation around feeding and nutrition during sessions, and informative materials to promote healthy nutritional habits.

H.2.2. Sample and Evaluation
Design. The evaluation sample consists of 1,466 children 0-12 months at baseline and 553 pregnant women in 340 FAMIs located in 87 municipalities of three Colombian departments of the Andean region: Boyacá, Santander and Cundinamarca. These municipalities were selected for their geographical location, their semi-rurality and rurality conditions and the presence (or not) of other similar ECD services such as another parenting program similar to the one included in this study, known as MF or modalidad familiar.
From a universe of 151 eligible municipalities (i.e., municipalities with less than 40.000 inhabitants, at most one MF and that belong to Boyacá, Santander, Cundinamarca and Tolima departments), we selected all municipalities with no MFs and then we selected from the remaining municipalities with at most one MF, striving to achieve distributed geographic coverage, until we reached 96 town. Then we ran a strati ed randomization based on (i) MF (modalidad familiar) presence, (ii) Department and (iii) Population size (less than or more than 10.000 inhabitants) resulting in 49 municipalities in the treatment group and 44 in the control group. We then dropped FAMIs that were transiting or were going to transit to the new version of the program (MF) and, as a consequence, had to drop all study municipalities located in the department of Tolima since no control municipalities were left for this group. This procedure implies a nal sample of 41 municipalities with 169 FAMIs in the control group and 46 municipalities with 171 FAMIs in the treatment group.
Baseline data collection on the children and the pregnant women, their households, the MCs and the centers they attend took place between August and November 2014. Follow-up data collection will take place from April to July 2016. We collected baseline data directly in participants' homes, except in those instances in which it was not possible to interview the mother of the child in her own home, in which case the interview took place in community centers at the town's urban center such as schools, churches or in the FAMI.
The analysis of baseline data shows that the sample is balanced across the evaluation groups. Whilst there are some signi cant differences in children's nutritional status and a few socio-demographic characteristics (e.g. presence of the father and mother's TVIP and personality), and in some of sociodemographic characteristics of MCs, none of these differences systematically occur in one of the study groups nor point towards a speci c (bias) direction.

H.3. Hypotheses to be Tested and Data
We have collected (at baseline) and are collecting (at follow-up) a rich set of data and we will use it to test a series of hypotheses concerning the impacts of the intervention under study. We present the study hypotheses in two groups: impact on children's outcomes and impact on mothers' parenting abilities and on the learning environment at home.

H.3.1. Hypotheses Group A: Impact on
Children's Outcomes. The treatment may have positive average impacts on outcomes for children attending the program. We group these outcomes into two areas: children's development and children's nutritional status.
We consider a number of domains within development-namely, some aspects of cognition, language and motor and socio-emotional development. We next list the speci c hypothesis on each of them by domain of development, and detail the speci c tests (and scales) we will use to measure them and how we will process the data. We would however like to ag two considerations before proceeding.
On the one hand, we would like to clarify that we plan to use factor analysis (on standardized scores) to determine the most appropriate way of combining the various tests and scales collected in "constructs". The reason for this is that child development is composed of many different dimensions that are interrelated. For example, even if a vocabulary test should be viewed as an achievement test as opposed to a measure of raw ability (given that it measures acquired vocabulary) it nevertheless correlates well with ability. Hence, it is very dif cult to establish a priori the most sensible way to organize the data, and we plan to rely on factor analysis to combine data that captures common underlying constructs (data that would be thought to go together on theoretical grounds). To do this, we will follow standard protocols in the use of factor analysis. First, we will construct as many factors ("constructs") as there are with eigenvalues larger than 1. Next, we will only use outcomes (scales or tests) with factor loads larger than 0.4 in the construction of these factors. Hence, the following categorization of scales and tests in domains is for the purpose of illustrating our hypothesis and may be modi ed as a result of the outcomes of factor analysis.
In addition to the main analysis based on the impacts on the factors we have identi ed, we will also report impacts on the individual tests, correcting our pvalues for multiple hypotheses testing, using the Romano-Wolf step-down procedure (Romano and Wolf 2005), as further discussed in Section 5.
On the other hand, note that the following child development outcomes, listed under Hypotheses Group A, will be collected at follow-up only (as they were not suitable for administration to children at baseline given their age) by direct administration to children by a trained psychologist at a community venue in each town. An expert psychologist with extensive training and practice in the collection of the assessment instruments included in this study trained a group of 10 psychologists who administered the assessments in the eld. The only exception to this is the ASQ:SE, which is collected by maternal report (direct interview with the mother, and was included as part of the household questionnaire). Children were 0-12 months of age at baseline and will be 15-29 months of age at follow up, depending on the exact time at which they are assessed at follow-up. At baseline we collected different child outcomes, which will be used as baseline controls, as explained in Section 4.
H H.1. The treatment is likely to have a positive average impact on children's cognitive development, language development, and motor development.
These domains will be assessed using the indicators listed next. -The scales will be administered and scored as indicated in the Bayley-III administration manual. Higher scores indicate higher cognitive abilities. However, given the process of development, scores are also likely to increase with age. • Language Development: we will assess both receptive and expressive language.
-Receptive language will be assessed using the Spanish version of the Receptive Communication Scale of the Bayley Scales of Infant and Toddler Development, Third Edition (Bayley-III) (Bayley 2006). The scale evaluates pre-verbal behaviors and vocabulary development -i.e. the ability to identify objects and images that are been referenced.
-The scale will be administered and scored as indicated in the Bayley-III administration manual. Higher scores indicate higher receptive language development. However, given the process of development, scores are also likely to increase with age. -Expressive language will be assessed using the Spanish version of the Expressive Communication Scale of the Bayley Scales of Infant and Toddler Development, Third Edition (Bayley-III) (Bayley 2006). The scale evaluates pre-verbal communication, such as babbling, gesticulation, joint referencing and early talking; and vocabulary development, such as naming objects, images and attributes (eg, color and size).
-The scale will be administered and scored as indicated in the Bayley-III administration manual. Higher scores indicate higher expressive language development. However, given the process of development, scores are also likely to increase with age. • Motor Development: we will assess both ne and gross motor development.
- • Socio-emotional development: we will assess socio-emotional development using the Ages and Stages: Socio-Emotional Questionnaire (ASQ:SE), which screens several socio-emotional areas such as self-regulation, compliance, communication, adaptive behaviors, autonomy, affect, and interaction with people, for children 15-29 months at follow-up by parental report.
-The scale will be administered and scored as indicated in the ASQ:SE manual.
-While scores should not be age dependent, we will remove any lingering age effect standardizing the scores internally, i.e. using the distribution empirical mean and standard deviation, estimated using non-parametric regression methods.

H H.3. The treatment is likely to have a positive average impact on children's nutritional status.
We expect the intervention to improve nutritional status through the addition of a nutritional supplement, the education provided about feeding and nutrition during sessions and home visits, and the provision of informative materials to promote healthy nutritional habits. Children's nutritional status is measured both at baseline and at follow-up by personnel that is trained by an expert nutritionist and assessed for reliability. In particular, we collected information on height, weight and body mass index (BMI) following World Health Organization (WHO) standards (World Health Organization 2006, 2007 for all children in our sample, both at baseline and followup. Based on these measures, we will construct a variety of nutritional indicators depending on the child's age and based on World Health Organization (2006,2007) standards.

H H.4. The treatment is likely to have a positive average impact on children's food insecurity status.
We expect the intervention to improve the household's and the child's food insecurity status as a result of the nutritional supplement delivered with the program. We assess food insecurity status using the Latin-American and Caribbean Nutritional and Food Insecurity Scale (ELCSA) adapted and validated for this population.
H.3.2. Hypotheses Group B: impact on the mother's parenting skills and the learning environment at home. The intervention is more likely to improve children's outcomes if the children's mothers effectively internalized the message and training delivered by the program's new curriculum and took these lessons to practice with their own children. In particular, the program might have been more effective if maternal selfesteem, mental health and motivation improve as a result of program participation; if the quality of the home environment improves; if discipline practices at home are more appropriate; and if parents devote more time to stimulating activities with their children such as reading and playing. Also, these things are more likely to have happened if parents participated more regularly in the program's group sessions and home visits.
We will test the hypotheses that the intervention will have an impact on mother's parenting skills, parental knowledge and perceptions, parental self-efficacy, mental health, and the home environment. We also hypothesize that these changes in parental practices, knowledge, perceptions, self-esteem, mental health, motivation and changes in the learning environment at home will correlate with (and contribute to-i.e

. mediate) the impacts on children's outcomes described above (Hypotheses Group A).
We next describe the outcomes previously listed and how we will construct them. Questionnaire 11 to assess the mother's perception about the amount and type of functional social support that she receives. Given the characteristics of the intervention, we seek to understand whether parenting group meetings contribute to a perception of enhanced social support, which might contribute towards maternal self-ef cacy and maternal mental health. 8. Mothers attend FAMI group sessions and receive home visits: We will use selfreported data on attendance and turnover, collected from the main caregiver, and from administrative data on attendance.

H.4. Use of the Qualitative and Process Data
In addition, as part of a qualitative evaluation, we are collecting process data on a subsample of FAMIs that will contribute to inform the quality (and delity) of the program, and provide insights on the extent to which the mechanisms through which the program has (or did not have) impacts occurred. The objective of this qualitative component is to characterize the interventions in the eld with a detailed tracking of activities and responsible individuals related to the components of the interventions under study. Based on direct observations, in-depth interviews and session videotaping we hope to gain a better understanding of the ways in which the interventions are understood, adopted and used directly by MCs. In addition, interviews to bene ciary mothers are taking place, to capture mother's perceptions about the program's strengths and weakness. With this input we hope to be able to (1) better interpret our quantitative estimates of program impacts, (2) identify possible transmission mechanisms of the effects of interest, and (3) propose speci c policy recommendations for program improvement based on the joint analysis of the quantitative and the qualitative results.

H.5. Empirical Strategy
H.5.1. Baseline Specification. Given the experimental design described in Section 2, we can identify the impacts of the treatment on outcomes using the following estimating equation: where Y i sl;1 is the outcome of interest for child i in FAMI center s in municipality l at follow-up (t=1); T 1sl is a dummy equal to 1 if the FAMI center s in municipality l receives the treatment; and Y isl;0 is the baseline (t=0) level of the outcome of interest (or level of the corresponding aggregate construct in the case that the same measure was not administered at baseline and follow-up) for child i in FAMI center s in municipality l at follow-up. For child developmental outcomes we will not have the same outcome at baseline and follow-up since the tests could not be administered given children's ages at baseline. For these outcomes, we will control for all existing aggregate scores (constructed using factor analysis, as described above) and including all developmental scores and nutritional scores. The purpose of this approach is to maximize ef ciency. X 0 isl;0 is a set of basic child and household characteristics, which are also added to improve ef ciency (minimize residual variance) and control for the slight imbalance in some baseline characteristics observed between groups at baseline (detailed list pending); D isl;0 are a set of department xed effects, F i sl;0 are a set of dummies indicating the presence or not of the alternative parenting program in town l (modalidad familiar) and S isl;0 are a set of municipality population size dummy variables indicating above and below 10 thousand inhabitants (all included due to our strati ed randomization procedure), andZ i sl;1 are a complete set of tester or interviewer dummies. " isl;1 is the random error term, clustered at the municipal level l (the unit of randomization).
We can estimate equation (H.1) by OLS.ˇ1 is the estimated average impact of the treatment on outcome Y isl;1 ; (intent-to-treat estimate). If compliance is not complete in the sense that children do not attend all program sessions they are intended to (and assuming this non-compliance is not larger than 40%) we would additionally estimate duration of exposure to treatment effects by instrumenting actual duration of participation in the program with the result of the random assignment (intentionto-treat or randomized treatment variable). We would also assess the extent to which actual duration of exposure to treatment is correlated with treatment status. Set B). We can also use equation (H.1) to estimate the impact of the treatment on intermediate outcomes. When the impact on mother's parenting skills and the home environment refer to the child's mother (or other caregiver) we will replace the set of basic covariates in X with mother (or other caregiver) basic baseline or time invariant characteristics (e.g., age and educational attainment).

H.5.3. Dealing with Testing for Multiple Outcomes through Standardized Treatment
Effects and Adjustments for Multiple Inference. For some of the developmental domains analyzed in this study, we have more than one outcome measure with which to explore treatment effects. To deal with multiple hypothesis testing we will employ two approaches.
The rst approach will be to group our outcome measures into domains or "constructs" using factor analysis (following the procedure described in Section 3) and estimate equation (1) using the resulting factor index as the relevant dependent variable. This procedure is based on the idea that items within a domain are measuring an underlying common "construct" (or factor).
The second approach will consist of estimating each outcome (individual test) independently but adjusting p-values for multiple hypotheses testing using the stepdown procedure developed by Romano-Wolf on each set of (Romano and Wolf 2005).
H.5.4. Survey Attrition. We acknowledge that a certain level of attrition is unavoidable. We will check that the sample of non-attriters remains balanced on baseline observables (as is the entire sample). We will also check that attrition is independent of treatment status. In the event of a signi cant correlation between attrition and treatment status we will estimate the determinants of attrition using a probit regression on observables and if feasible use a Heckman selection correction procedure to adjust the estimates of the main equation (H.1).
H.5.5. Procedures for Addressing Missing Data. We will not impute the values for any dependent variable ( nal or intermediate outcomes) at follow-up. Regarding missing data on covariates, Y isl;o and X 0 isl;o , we will check whether item non-response is correlated with treatment status. If it is not correlated, we will impute the missing covariate value with the average of the non-missing observations and this imputation will be accounted for with a dummy variable (we will check the robustness of our results by also estimating the regression without that covariate). If non-response in the baseline covariate is correlated with treatment status, we will not use that covariate when estimating the regressions. In cases in which the percentage of observations with covariate missing data is less than 2%, we will simply work with the sample with nonmissing data.
H.5.6. Questions with Limited Variation. We will not use as dependent or independent variables any indicator variable that has a prevalence rate of below 10% or above 90%, in order to limit noise caused by variables with minimal variation.
In the event that omission decisions result in the exclusion of all constituent variables (or for as many as indicated in the test manual) for an indicator, the indicator will not be calculated.
H.5.7. Treatment of Outliers. We will drop children with developmental outcomes or nutritional status with standardized values lower than 3 standard deviations below the mean (<-3SD) of the relevant standardized distribution, since we consider this to be an indication of potential disability (for developmental outcomes), severe malnutrition (for nutritional status) or signi cant measurement error.