Abstract

The goal of this study was to determine the overall effects of pronunciation instruction (PI) as well as the sources and extent of variance in observed effects. Toward this end, a comprehensive search for primary studies was conducted, yielding 86 unique reports testing the effects of PI. Each study was then coded on substantive and methodological features as well as study outcomes (Cohen’s d ). Aggregated results showed a generally large effect for PI ( d = 0.89 and 0.80 for N -weighted within- and between-group contrasts, respectively). In addition, moderator analyses revealed larger effects for (i) longer interventions, (ii) treatments providing feedback, and (iii) more controlled outcome measures. We interpret these and other results with respect to their practical and pedagogical relevance. We also discuss the findings in relation to instructed second language acquisition research generally and in comparison with other reviews of PI (e.g. Saito 2012). Our conclusion points out areas of PI research in need of further empirical attention and methodological refinement.

INTRODUCTION

The effectiveness of second language pronunciation instruction: a meta-analysis

Pronunciation instruction (PI) is one of several areas in the domain of instructed second language acquisition (SLA) that carries significant potential to inform both theory and practice. It is not surprising, therefore, that research on the effects of PI has been extensive, despite frequent commentary claiming the contrary (e.g. Derwing and Munro 2005 ). This line of research has examined PI across many learners and contexts (e.g. various target languages and proficiency levels), pedagogical approaches (with vs. without feedback), linguistic features (e.g . segmentals vs. suprasegmentals), and outcome types (i.e. constrained vs. guided vs. open-ended). Findings from studies in this area are summarized regularly in review articles and handbooks (e.g. Saito 2012 ). However, given the qualitative and non-comprehensive nature of such reviews, it is difficult to ascertain with certainty and precision the overall effects of PI. It is even more difficult, if not impossible, to determine with any precision the extent to which different factors may moderate the effects of PI, much less interpret their implications for pedagogy. As Darcy et al. (2012) have stated, ‘there is no agreed upon system of deciding what [pronunciation features] to teach, and when and how to do it’. And unlike all other linguistic foci targeted by second language (L2) instruction—grammar, vocabulary, pragmatics—quantitative results from this body of research have yet to be synthesized via meta-analysis, hence pointing to the need for this study as a means to inform our understanding of PI as well as for more general development of L2 theory and practice.

The literature review that follows is broken into two main parts: first, we provide a brief outline of research on the effectiveness of L2 instruction as demonstrated across other L2 features, highlighting the findings of previous meta-analyses of these areas as they relate to and inform the present study. We then move on to the focus of this study, PI, and a description of its theoretical and practical importance. We also provide an overview of empirical investigations in this area organized around different contexts, treatments, and outcomes that vary in this body of research and that are suggested to moderate the effects of PI.

LITERATURE REVIEW

Pronunciation and instructed SLA

The effectiveness of L2 instruction has been the object of extensive empirical investigation in the field of SLA. Researchers have examined the effects of instruction on a wide range of L2 features and skills including grammar/morphosyntax, vocabulary, pragmatics, and the focus of this synthesis, pronunciation (e.g. Derwing and Munro 2005 ; Saito 2012 ). Early research in these areas was often concerned with the rather broad question of whether or not (explicit) instruction led to L2 development, compared with input-only or meaning-based approaches such as those advocated by Krashen (1982) , VanPatten (2002) , and others. (For current reviews, see Shintani 2015 , and Shintani et al. 2013 ). Extensive research since the 1980s, and Norris and Ortega’s (2000) meta-analysis in particular, however, has largely put this debate to rest. Empirical efforts have since turned to the generalizability of instructional effects. That is, studies have looked at instructed L2 acquisition as a function of different learner backgrounds and contexts (e.g. second vs. foreign language setting), different types of linguistic features (e.g. simple vs. complex), and different types of instruction (e.g. explicit vs. implicit), among other variables.

With the exception of pronunciation, these same subdomains and the questions they address have been meta-analyzed, often multiple times. Figure 1 presents summary effects from these studies, organized according to the linguistic target of instruction: grammar, vocabulary, and pragmatics. Several results across meta-analyses of L2 instruction are worth noting. First, there is clearly substantial variability in observed effects both across and within different subdomains of instructed SLA; meta-analytic d values range from one-third of a standard deviation (implicit instruction on simple grammar forms; Spada and Tomita 2010 ) to a difference between control and experimental groups of nearly two standard deviations [as found in Shintani et al. ’s (2013) results for receptively measured effects of comprehension-based grammar instruction]. This set of results also indicates the extent to which different subdomains of instructed SLA research have been summarized via meta-analysis. Whereas the effects of instruction on L2 grammar and vocabulary have been documented fairly well at the meta-analytic level, this is not the case for pragmatics instruction. No study to date has meta-analyzed the effects of PI, the focus of the present study. This gap in the literature limits our understanding of the merit of PI in L2 classes as well as of instruction in this area relative to other target features and subdomains such as those shown above.

Figure 1:

Overall/meta-analytic effects of instructed SLA across subdomains

Figure 1:

Overall/meta-analytic effects of instructed SLA across subdomains

Our interest in this study, however, is not solely in the overall extent to which this body of research has led to improvements in L2 pronunciation. As with other domains of L2 instruction, the effects of PI are likely to vary as a function of different substantive and methodological features. In the remainder of the literature review, we therefore describe different contexts, treatments, and outcomes with respect to their potential role in moderating the effects of PI, highlighting relevant studies when appropriate.

The role of contextual and learner factors in PI

As shown in numerous studies and meta-analyses in other SLA subdomains, instructional context and learner background can greatly influence the impact of a pedagogical intervention (e.g. Plonsky and Oswald 2014 ). Such variables might include participants’ age, proficiency level, type of educational institution, second vs. foreign language environment, and whether the study is carried out in a laboratory or classroom setting.

Given evidence in favor of a critical period for phonological development (e.g. Flege et al. 1999 ), the role of age may be particularly strong in the case of pronunciation. Specifically, although PI is more often tested with adult learners, we might predict larger effects for studies involving children (e.g. Trofimovich et al. 2009 ; Tsiartsioni 2010 ). We might also expect to find larger effects in laboratory-based (as opposed to classroom-based) studies owing to increased experimental control in the former. Li (2010) , for example, found that the average effect of corrective feedback in laboratory-based studies ( d = 1.08) was more than twice that of classroom-based studies ( d = 0.50). Likewise, PI may lead to larger effects in second-language settings than foreign language settings owing to the value attributed to speaking and sounding native-like in the former ( Derwing 2003 ; but cf. Tokumoto and Shibata 2011 ). Finally, the effectiveness of PI may also be related to the proficiency of the participants. Derwing and Munro (2005) , for example, argue that instruction yields more rapid improvement in lower-level learners. More advanced learners who possess foundational knowledge of pronunciation as well as other skills, however, may be able to integrate and adapt their pronunciation more readily.

Treatment and target features

One of the most critical considerations with respect to designing interventions that seek to improve L2 pronunciation is the type of feature(s) to target. A great deal of discussion in this area surrounds the relative effectiveness of instruction on segmental vs. suprasegmental features. Levis (2005) and Saito (2014) have suggested that segmental phonology may be easier for teachers to teach and for learners to learn. Others (e.g. Hahn 2004 ), however, claim instruction on suprasegmental features to be more effective. The importance of instruction on suprasegmentals is also underscored by their impact on comprehensibility and accentedness (e.g. Kang 2010 ; Isaacs and Trofimovich 2012 ).

Very few empirical investigations have addressed the relative effectiveness of PI on these two feature types. In Derwing et al. (1998) , one group received instruction on segmental features (e.g. individual sound contrasts) and another on suprasegmental features (rhythm, intonation, and stress). In comparison with a control group that received no pronunciation-specific instruction, both groups improved on perceived accentedness and comprehensibility as measured on a read-aloud task. However, only the suprasegmental group showed improvement on a less-controlled, picture description task (see discussion of outcome types in the following section). The results of Gordon and Darcy (2012) and Yates (2003) are even clearer, showing an effect of PI on suprasegmentals to have almost twice the effect of segmentals. However, using a ‘vote-count’ 1 approach to synthesizing research on PI, Saito (2012) found that studies providing instruction on segmental and suprasegmental features both generally lead to gains.

In addition to different linguistic foci, several other treatment features may also be related to the effects of PI. For instance, studies in this area have often included a technological component. Researchers often use programs/software such as Anvil or visual input such as spectrograms to provide stimuli, feedback, minimal pair practice, and so forth. In some cases, technology has been used to complement teacher- or researcher-delivered instruction (e.g. Lord 2008 ); in others, a computer program is the sole provider of instruction ( Hardison 2005 ).

Often occurring in conjunction with technology in the form of adaptive instruction is feedback. Although many studies have included feedback as part of a treatment, the study by Saito and Lyster (2012) is perhaps the only one to have done so in a way that allows the effects of feedback to be measured directly. Their findings show an advantage for a treatment consisting of form-focused instruction plus feedback (recasts) over form-focused instruction alone on both controlled and free response outcome measures.

The length of a treatment may also be related to its effectiveness. This feature is not unique to studies of PI. In fact, several meta-analyses have investigated summary effects as a function of treatment duration (e.g. pragmatics: Jeon and Kaya 2006 ; strategy instruction: Plonsky 2011 ). As we might expect, these studies generally find longer treatments to produce stronger effects. Plonsky and Oswald (2014) , however, warn that a strong correlation between treatment length and effect size may put into question the practicality of such interventions. In other words, instructional costs (time and energy) must be weighed against their potential benefits for L2 learners.

Outcome measures of PI

The literature review has thus far focused on independent variables of PI research. In this section, we discuss different types of outcome measures with respect to their potential to moderate the effects of PI. One feature that may impact the results of PI is the extent to which items are controlled (i.e. requiring a fixed response from all participants) vs. ‘free’ (i.e. productive measures that are open-ended, allowing for a variety of different responses). PI researchers may prefer more controlled or shorter items as a means to ensure that participants produce the target feature(s). Such items, however, may not accurately represent learners’ ability in carrying out more authentic, real-world tasks (see Saito and Lyster 2012 ). Furthermore, the artificiality and lack of communicative value in controlled and/or word-length tasks may allow learners to focus more on their pronunciation, thus leading to larger effects (see Elliot 1997 ; Saito 2012 ). (See Norris and Ortega 2000 ; Spada and Tomita 2010 for discussion and evidence of this phenomenon in the context of grammar instruction.)

Another outcome feature that may be related to treatment effects is rater background, an issue subject to extensive investigation in the literature on L2 pronunciation assessment (e.g. Kang 2012 ). It has been suggested, and there is empirical evidence to suggest, that rater characteristics such as native language, experience working with nonnative speakers, and knowledge of the target language may affect their evaluations of L2 pronunciation (e.g. Isaacs and Thomson 2013 ). These factors tend to play a more prominent role in ratings of pronunciation compared with assessments of grammar and vocabulary.

Research questions

In order to better understand both overall effects of PI and to explain potential moderators of those effects, the present study addressed the following research questions:

  1. What is the overall effectiveness of instruction on L2 pronunciation?

  2. What is the relationship between PI and different contexts, treatment types, and outcome measures?

METHODS

Study identification

Before searching for studies that might help us answer our research questions, we defined a set of inclusion/exclusion criteria. In order to be included, a study had to (i) report the findings of an experiment or quasi-experiment in which L2 learners were provided with instruction on one or more aspects of pronunciation; (ii) present quantitative results of the study; and (iii) demonstrate the effects of PI using a pre–post (within groups) and/or control/comparison-experimental (between groups) design.

Having determined the parameters of our search, we set out to locate relevant primary studies. In doing so, we employed a wide and diverse set of techniques, accepting redundancy in exchange for comprehensiveness (see Plonsky and Brown 2015 ). First, using combinations of keywords ( second language , foreign language , pronunciation , and instruction ), we searched library-housed databases including Educational Resources Information Center, Linguistics and Language Behavior Abstracts, PsycINFO, PsycArticles, Web of Science, and ProQuest Dissertations and Theses as well as two nonlibrary databases: Google and Google Scholar. We conducted ancestry searches by examining the references of previous reviews (e.g. Saito 2012 ) and all candidate studies. We contacted authors and gratefully received manuscripts from two individual authors (Veronica Sardegna and Isabelle Darcy). We manually searched all four available Proceedings of the Pronunciation in Second Language Learning and Teaching Conference. Finally, we consulted previously generated bibliographies of PI, and we examined the professional web pages of researchers known for their work in this area (Tracey Derwing, Kazuya Saito, and John Levis).

Our search revealed a number of studies that appeared to meet these criteria but were excluded for one or more of the following reasons: (i) outcomes other than pronunciation (e.g. attitudes and overall proficiency) were assessed following PI (e.g. Miller 2013 ); (ii) duplicate data were presented in a different report such as a dissertation (e.g. Ingels 2010 ); (iii) L2 pronunciation development was assessed over time but without a treatment ( Derwing et al. 2006 ); (iv) only qualitative outcomes were provided ( Lee 2008 ); (v) the design included neither a pretest nor a control/comparison group ( Miller 2012 ); (vi) the effects of PI were based on a single participant, thus preventing the calculation of an effect size ( Bertram 2008 ). And finally, numerous studies (25 studies or 29 per cent of the total sample—see below) were excluded owing to missing data, usually standard deviations. Requests for missing/unreported data were sent via email to 16 authors: 5 provided the data, 2 4 responded that they could not find the data, and 7 never responded to the request. (For a discussion on data sharing and transparency, see Plonsky et al. in press .)

Our search led to 86 study reports that were included in the final analysis (see Supplementary Information for references of included studies). Although the majority of the studies were journal articles ( k = 45, 52 per cent), a number of dissertations/theses and articles in conference proceedings were included as well ( k = 19, 22 per cent and k = 15, 19 per cent, respectively). A small number of studies were also found in book chapters (3), conference presentations (PowerPoints; 2), and unpublished manuscripts (1). Unfortunately, the norm in L2 meta-analyses has not been to take such an inclusive approach, leaving synthetic samples much more susceptible to the inflating effects of publication bias ( Plonsky and Brown 2014 ).

Within these 86 reports, effect sizes were extracted based on 110 within-group (pre–post) and 60 between-group (control–experimental) samples. This sample is substantially larger than any of the meta-analyses of L2 instruction mentioned above ( Figure 1 ). The total N for all studies was 2,782, consisting of 777 control participants (median = 12) and 2005 experimental participants (median = 14).

Although the studies in this sample date back over 32 years, most are fairly recent. The sample includes three studies from the years 1982 to 1989, seven from 1990 to 1997, sixteen from 1998 to 2005, and fifty-nine from 2006 to 2013 (including one paper that is ‘in press’). (The date of one additional study could not be identified.) Interest in the effectiveness of PI is clearly strong and increasing, despite frequent claims to the contrary (e.g. Derwing and Munro 2005 ).

Coding

Each study was coded for substantive and methodological features as well as effect sizes (Cohen’s d ) in order to answer our two research questions. In particular, our coding scheme was designed to extract data related to study contexts, treatment types, and outcome variables (see Supplementary Table 1 ). In order to ensure interrater reliability, the second and third authors both coded the entire sample, as recommended by Plonsky and Oswald (2012) . Disagreements were discussed and resolved, and operational definitions were adjusted when necessary.

Analysis

Both research questions involved calculating descriptive statistics based on effect sizes. Multiple effects or outcomes derived from a single sample were combined (averaged). Each sample’s effects were kept separate to preserve differences between multiple treatment groups. Two outliers based on within-groups contrasts from a single report were identified ( d = 8.80, 8.55) and excluded from further analysis. In order to inspect the data for additional irregularities and/or evidence of publication bias, a funnel plot (i.e. a scatterplot of effect sizes on the x -axis and sample sizes on the y -axis) was created and examined.

Research Question 1 (overall effectiveness) was then addressed by calculating sample-size-weighted descriptive statistics for the d values from the entire sample. Effects resulting from within-groups designs often appear larger than those for between-groups because participants in the former serve as their own control, thus reducing error variance and inflating the observed effect ( Plonsky and Oswald 2014 ). This difference can be corrected based on pre–post correlations, but such data were not reported in any of the studies in the sample. Therefore, effects from pre–post contrasts and between-groups contrasts were analyzed separately.

Research Question 2 addressed variability in observed effects as a function of variables suggested to moderate the effectiveness of PI. Study features (i.e. different contexts, treatments, and outcomes) were therefore treated as independent variables and used to group and calculate summary effects which were, again, weighted by sample size. Although between-group contrasts arguably provide a more theoretically and statistically accurate depiction, the available sample of pre–post effects was much larger and therefore more reliable/robust: K = 110 vs. 60. From a practical standpoint, pre–post effects also provide insight regarding what might be expected for PI as implemented in real classrooms. This phase of the analysis is therefore based on both within- and between-group contrasts, with an emphasis on the former.

RESULTS

As described in the Methods, we created funnel plots to inspect the data for the presence of publication bias or other irregularities. We see in Figures 2 and 3 , first of all, substantial variability in effect sizes, with greater spread at the bottom of the figure where samples are smaller and sampling error is higher. We also see that these effects are not spread equally on both sides of the mean effect, particularly in the case of pre–post effects ( Figure 2 ; unweighted d = 0.83). Rather, larger effects (to the right of the mean) display much wider variability than smaller ones (left of the mean). This difference in spread is indicative of a bias toward statistically significant results. In the absence of bias (i.e. when observing a sample representative of the population of effects), we would expect to see relative symmetry on both sides of the mean.

Figure 2:

Funnel plot of effect sizes (d; x-axis) and sample sizes (N; y-axis) for within-group contrasts

Figure 2:

Funnel plot of effect sizes (d; x-axis) and sample sizes (N; y-axis) for within-group contrasts

Figure 3:

Funnel plot of effect sizes (d; x-axis) and sample sizes (N; y-axis) for between-group contrasts

Figure 3:

Funnel plot of effect sizes (d; x-axis) and sample sizes (N; y-axis) for between-group contrasts

The results for Research Question 1, which addressed the overall effects of PI, are found in Table 1 . Summary results from both within- and between-group designs (weighted by sample size) show that PI is indeed effective. Observed effects on average and across the 95 per cent confidence intervals demonstrate a medium-to-large and statistically significant effect ( Plonsky and Oswald 2014 ).

Table 1:

Overall results for the effectiveness of L2 pronunciation instruction

Contrast/design  K M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Within-group 110 0.89 0.02 0.85 0.94 
Between-group 60 0.80 0.02 0.77 0.81 
Contrast/design  K M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Within-group 110 0.89 0.02 0.85 0.94 
Between-group 60 0.80 0.02 0.77 0.81 
Table 1:

Overall results for the effectiveness of L2 pronunciation instruction

Contrast/design  K M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Within-group 110 0.89 0.02 0.85 0.94 
Between-group 60 0.80 0.02 0.77 0.81 
Contrast/design  K M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Within-group 110 0.89 0.02 0.85 0.94 
Between-group 60 0.80 0.02 0.77 0.81 

Whereas Research Question 1 was concerned with the overall/summary effects of PI, the focus of Research Question 2 was on potential moderators of those effects. In other words, this phase of the analysis sought to examine variability across the sample as a function of different (i) contexts, (ii) treatments (including targeted linguistic features), and (iii) outcome types found in studies of PI.

Tables 2 and 3 present the results of moderator (or subgroup) analyses for contextual variables. Confidence intervals for numerous subgroups here often do not overlap, indicating that differences between their effects are statistically significant. Moreover, several patterns among within-group contrasts are worth noting such as larger effects in second-language settings, high schools, and in studies with both beginner and advanced learners (as opposed to intermediate ones). In contrast to what we might expect, laboratory-based studies produced smaller effects than those carried out in classrooms. However, the opposite pattern was found in between-groups contrasts ( d = 0.79 in classrooms vs. 0.95 in laboratories; see Table 3 ).

Table 2:

Moderator analyses across contexts (within-group contrasts)

Grouping variables and values ka M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Setting      
    Second language 39 1.01 0.04 0.94 1.08 
    Foreign language 71 0.83 0.03 0.78 0.89 
Institution      
    High school 10 1.42 0.07 1.30 1.56 
    University 86 0.83 0.02 0.78 0.87 
    Language institute 10 0.66 0.05 0.56 0.76 
Context      
    Classroom 71 0.95 0.03 0.89 1.01 
    Laboratory 32 0.84 0.03 0.78 0.89 
Proficiency      
    Beginner 24 1.27 0.06 1.15 1.40 
    Intermediate 42 0.55 0.02 0.51 0.58 
    Advanced 16 1.19 0.06 1.07 1.32 
Grouping variables and values ka M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Setting      
    Second language 39 1.01 0.04 0.94 1.08 
    Foreign language 71 0.83 0.03 0.78 0.89 
Institution      
    High school 10 1.42 0.07 1.30 1.56 
    University 86 0.83 0.02 0.78 0.87 
    Language institute 10 0.66 0.05 0.56 0.76 
Context      
    Classroom 71 0.95 0.03 0.89 1.01 
    Laboratory 32 0.84 0.03 0.78 0.89 
Proficiency      
    Beginner 24 1.27 0.06 1.15 1.40 
    Intermediate 42 0.55 0.02 0.51 0.58 
    Advanced 16 1.19 0.06 1.07 1.32 

a Number of samples in subgroups. In a small number of cases (e.g. elementary schools), subgroup results in this and the following tables were excluded due to very small cell sizes.

Table 2:

Moderator analyses across contexts (within-group contrasts)

Grouping variables and values ka M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Setting      
    Second language 39 1.01 0.04 0.94 1.08 
    Foreign language 71 0.83 0.03 0.78 0.89 
Institution      
    High school 10 1.42 0.07 1.30 1.56 
    University 86 0.83 0.02 0.78 0.87 
    Language institute 10 0.66 0.05 0.56 0.76 
Context      
    Classroom 71 0.95 0.03 0.89 1.01 
    Laboratory 32 0.84 0.03 0.78 0.89 
Proficiency      
    Beginner 24 1.27 0.06 1.15 1.40 
    Intermediate 42 0.55 0.02 0.51 0.58 
    Advanced 16 1.19 0.06 1.07 1.32 
Grouping variables and values ka M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Setting      
    Second language 39 1.01 0.04 0.94 1.08 
    Foreign language 71 0.83 0.03 0.78 0.89 
Institution      
    High school 10 1.42 0.07 1.30 1.56 
    University 86 0.83 0.02 0.78 0.87 
    Language institute 10 0.66 0.05 0.56 0.76 
Context      
    Classroom 71 0.95 0.03 0.89 1.01 
    Laboratory 32 0.84 0.03 0.78 0.89 
Proficiency      
    Beginner 24 1.27 0.06 1.15 1.40 
    Intermediate 42 0.55 0.02 0.51 0.58 
    Advanced 16 1.19 0.06 1.07 1.32 

a Number of samples in subgroups. In a small number of cases (e.g. elementary schools), subgroup results in this and the following tables were excluded due to very small cell sizes.

Table 3:

Moderator analyses across contexts (between-group contrasts)

Grouping variables and values  k a  M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Setting      
    Second language 16 0.35 0.02 0.31 0.38 
    Foreign language 44 0.98 0.02 0.94 1.03 
Institution      
    High school 1.19 0.09 1.01 1.37 
    University 46 0.77 0.02 0.73 0.81 
    Language institute 1.09 0.03 1.03 1.15 
Context      
    Classroom 38 0.79 0.02 0.75 0.84 
    Laboratory 17 0.95 0.03 0.89 1.00 
Proficiency      
    Beginner 18 0.97 0.04 0.90 1.04 
    Intermediate 17 0.80 0.02 0.76 0.85 
    Advanced −0.01 0.02 −0.06 0.03 
Grouping variables and values  k a  M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Setting      
    Second language 16 0.35 0.02 0.31 0.38 
    Foreign language 44 0.98 0.02 0.94 1.03 
Institution      
    High school 1.19 0.09 1.01 1.37 
    University 46 0.77 0.02 0.73 0.81 
    Language institute 1.09 0.03 1.03 1.15 
Context      
    Classroom 38 0.79 0.02 0.75 0.84 
    Laboratory 17 0.95 0.03 0.89 1.00 
Proficiency      
    Beginner 18 0.97 0.04 0.90 1.04 
    Intermediate 17 0.80 0.02 0.76 0.85 
    Advanced −0.01 0.02 −0.06 0.03 
Table 3:

Moderator analyses across contexts (between-group contrasts)

Grouping variables and values  k a  M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Setting      
    Second language 16 0.35 0.02 0.31 0.38 
    Foreign language 44 0.98 0.02 0.94 1.03 
Institution      
    High school 1.19 0.09 1.01 1.37 
    University 46 0.77 0.02 0.73 0.81 
    Language institute 1.09 0.03 1.03 1.15 
Context      
    Classroom 38 0.79 0.02 0.75 0.84 
    Laboratory 17 0.95 0.03 0.89 1.00 
Proficiency      
    Beginner 18 0.97 0.04 0.90 1.04 
    Intermediate 17 0.80 0.02 0.76 0.85 
    Advanced −0.01 0.02 −0.06 0.03 
Grouping variables and values  k a  M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Setting      
    Second language 16 0.35 0.02 0.31 0.38 
    Foreign language 44 0.98 0.02 0.94 1.03 
Institution      
    High school 1.19 0.09 1.01 1.37 
    University 46 0.77 0.02 0.73 0.81 
    Language institute 1.09 0.03 1.03 1.15 
Context      
    Classroom 38 0.79 0.02 0.75 0.84 
    Laboratory 17 0.95 0.03 0.89 1.00 
Proficiency      
    Beginner 18 0.97 0.04 0.90 1.04 
    Intermediate 17 0.80 0.02 0.76 0.85 
    Advanced −0.01 0.02 −0.06 0.03 

A number of treatment-related variables were also examined for their potential to moderate the effects of PI (see Tables 4 and 5 ). Longer interventions were found to produce substantially larger effects than shorter ones in both within- and between-group contrasts. Treatments that included feedback as part of the treatment also outperformed those without, particularly in between-group designs. The effects of computer-provided treatments and those otherwise involving technology such as spectrograms, however, both yielded small effects compared with those provided by a teacher or a teacher–researcher and without the use of technology, respectively, in both within- and between-group designs. Effects across targeted linguistic features appear relatively homogenous. In within- and between-group contrasts, word stress, sentence stress, rhythm, and PI on both segmentals and suprasegmentals (compared with segmentals or suprasegmentals on their own) present exceptions wherein somewhat larger effects were produced.

Table 4:

Moderator analyses across treatment types (within-group contrasts)

Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Length      
    Short (≤4.25 h) 44 0.62 0.02 0.57 0.67 
    Long (>4.25 h) 40 1.32 0.05 1.23 1.41 
Treatment provider      
    Teacher 53 0.85 0.03 0.79 0.91 
    Researcher 0.43 0.05 0.33 0.53 
    Teacher–researcher 20 1.35 0.07 1.21 1.50 
    Computer 29 0.75 0.03 0.70 0.81 
Target features      
    Vowels 53 0.91 0.03 0.82 0.91 
    Consonants 62 1.04 0.03 0.98 1.11 
    Stress (word) 22 1.09 0.05 0.99 1.18 
    Stress (sentence) 19 0.96 0.05 0.87 1.05 
    Intonation 34 0.86 0.03 0.80 0.91 
    Segmentals 78 0.89 0.03 0.84 0.95 
    Suprasegmentals 43 1.03 0.04 0.95 1.11 
    Segmentals + suprasegmentals 22 1.00 0.06 0.88 1.11 
    Rhythm 13 0.98 0.06 0.86 1.10 
Use of technology      
    No 69 0.96 0.03 0.90 1.02 
    Yes 38 0.76 0.03 0.71 0.81 
Feedback      
    No 26 0.89 0.06 0.78 1.01 
    Yes 80 0.92 0.02 0.87 0.96 
Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Length      
    Short (≤4.25 h) 44 0.62 0.02 0.57 0.67 
    Long (>4.25 h) 40 1.32 0.05 1.23 1.41 
Treatment provider      
    Teacher 53 0.85 0.03 0.79 0.91 
    Researcher 0.43 0.05 0.33 0.53 
    Teacher–researcher 20 1.35 0.07 1.21 1.50 
    Computer 29 0.75 0.03 0.70 0.81 
Target features      
    Vowels 53 0.91 0.03 0.82 0.91 
    Consonants 62 1.04 0.03 0.98 1.11 
    Stress (word) 22 1.09 0.05 0.99 1.18 
    Stress (sentence) 19 0.96 0.05 0.87 1.05 
    Intonation 34 0.86 0.03 0.80 0.91 
    Segmentals 78 0.89 0.03 0.84 0.95 
    Suprasegmentals 43 1.03 0.04 0.95 1.11 
    Segmentals + suprasegmentals 22 1.00 0.06 0.88 1.11 
    Rhythm 13 0.98 0.06 0.86 1.10 
Use of technology      
    No 69 0.96 0.03 0.90 1.02 
    Yes 38 0.76 0.03 0.71 0.81 
Feedback      
    No 26 0.89 0.06 0.78 1.01 
    Yes 80 0.92 0.02 0.87 0.96 
Table 4:

Moderator analyses across treatment types (within-group contrasts)

Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Length      
    Short (≤4.25 h) 44 0.62 0.02 0.57 0.67 
    Long (>4.25 h) 40 1.32 0.05 1.23 1.41 
Treatment provider      
    Teacher 53 0.85 0.03 0.79 0.91 
    Researcher 0.43 0.05 0.33 0.53 
    Teacher–researcher 20 1.35 0.07 1.21 1.50 
    Computer 29 0.75 0.03 0.70 0.81 
Target features      
    Vowels 53 0.91 0.03 0.82 0.91 
    Consonants 62 1.04 0.03 0.98 1.11 
    Stress (word) 22 1.09 0.05 0.99 1.18 
    Stress (sentence) 19 0.96 0.05 0.87 1.05 
    Intonation 34 0.86 0.03 0.80 0.91 
    Segmentals 78 0.89 0.03 0.84 0.95 
    Suprasegmentals 43 1.03 0.04 0.95 1.11 
    Segmentals + suprasegmentals 22 1.00 0.06 0.88 1.11 
    Rhythm 13 0.98 0.06 0.86 1.10 
Use of technology      
    No 69 0.96 0.03 0.90 1.02 
    Yes 38 0.76 0.03 0.71 0.81 
Feedback      
    No 26 0.89 0.06 0.78 1.01 
    Yes 80 0.92 0.02 0.87 0.96 
Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Length      
    Short (≤4.25 h) 44 0.62 0.02 0.57 0.67 
    Long (>4.25 h) 40 1.32 0.05 1.23 1.41 
Treatment provider      
    Teacher 53 0.85 0.03 0.79 0.91 
    Researcher 0.43 0.05 0.33 0.53 
    Teacher–researcher 20 1.35 0.07 1.21 1.50 
    Computer 29 0.75 0.03 0.70 0.81 
Target features      
    Vowels 53 0.91 0.03 0.82 0.91 
    Consonants 62 1.04 0.03 0.98 1.11 
    Stress (word) 22 1.09 0.05 0.99 1.18 
    Stress (sentence) 19 0.96 0.05 0.87 1.05 
    Intonation 34 0.86 0.03 0.80 0.91 
    Segmentals 78 0.89 0.03 0.84 0.95 
    Suprasegmentals 43 1.03 0.04 0.95 1.11 
    Segmentals + suprasegmentals 22 1.00 0.06 0.88 1.11 
    Rhythm 13 0.98 0.06 0.86 1.10 
Use of technology      
    No 69 0.96 0.03 0.90 1.02 
    Yes 38 0.76 0.03 0.71 0.81 
Feedback      
    No 26 0.89 0.06 0.78 1.01 
    Yes 80 0.92 0.02 0.87 0.96 
Table 5:

Moderator analyses across treatment types (between-group contrasts)

Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Length      
    Short (≤4.25 h) 25 0.73 0.02 0.68 0.77 
    Long (>4.25 h) 22 0.95 0.03 0.89 1.01 
Treatment provider      
    Teacher 35 0.89 0.02 0.85 0.94 
    Researcher 0.86 0.02 0.83 0.90 
    Teacher-researcher 0.94 0.06 0.81 1.06 
    Computer 12 0.24 0.02 0.19 0.28 
Target features      
    Vowels 27 0.99 0.02 0.95 1.04 
    Consonants 37 0.79 0.02 0.75 0.84 
    Stress (word) 13 1.01 0.05 0.91 1.12 
    Stress (sentence) 10 1.39 0.07 1.26 1.53 
    Intonation 15 0.38 0.05 0.73 0.94 
    Segmentals 46 0.87 0.02 0.84 0.93 
    Suprasegmentals 24 1.05 0.03 0.99 1.11 
    Segmentals + suprasegmentals 13 1.28 0.04 1.20 1.36 
    Rhythm 1.65 0.06 1.53 1.77 
Use of technology      
    No 44 0.87 0.02 0.83 0.91 
    Yes 16 0.53 0.04 0.46 0.60 
Feedback      
    No 13 0.62 0.02 0.57 0.66 
    Yes 44 0.91 0.02 0.86 0.96 
Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Length      
    Short (≤4.25 h) 25 0.73 0.02 0.68 0.77 
    Long (>4.25 h) 22 0.95 0.03 0.89 1.01 
Treatment provider      
    Teacher 35 0.89 0.02 0.85 0.94 
    Researcher 0.86 0.02 0.83 0.90 
    Teacher-researcher 0.94 0.06 0.81 1.06 
    Computer 12 0.24 0.02 0.19 0.28 
Target features      
    Vowels 27 0.99 0.02 0.95 1.04 
    Consonants 37 0.79 0.02 0.75 0.84 
    Stress (word) 13 1.01 0.05 0.91 1.12 
    Stress (sentence) 10 1.39 0.07 1.26 1.53 
    Intonation 15 0.38 0.05 0.73 0.94 
    Segmentals 46 0.87 0.02 0.84 0.93 
    Suprasegmentals 24 1.05 0.03 0.99 1.11 
    Segmentals + suprasegmentals 13 1.28 0.04 1.20 1.36 
    Rhythm 1.65 0.06 1.53 1.77 
Use of technology      
    No 44 0.87 0.02 0.83 0.91 
    Yes 16 0.53 0.04 0.46 0.60 
Feedback      
    No 13 0.62 0.02 0.57 0.66 
    Yes 44 0.91 0.02 0.86 0.96 
Table 5:

Moderator analyses across treatment types (between-group contrasts)

Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Length      
    Short (≤4.25 h) 25 0.73 0.02 0.68 0.77 
    Long (>4.25 h) 22 0.95 0.03 0.89 1.01 
Treatment provider      
    Teacher 35 0.89 0.02 0.85 0.94 
    Researcher 0.86 0.02 0.83 0.90 
    Teacher-researcher 0.94 0.06 0.81 1.06 
    Computer 12 0.24 0.02 0.19 0.28 
Target features      
    Vowels 27 0.99 0.02 0.95 1.04 
    Consonants 37 0.79 0.02 0.75 0.84 
    Stress (word) 13 1.01 0.05 0.91 1.12 
    Stress (sentence) 10 1.39 0.07 1.26 1.53 
    Intonation 15 0.38 0.05 0.73 0.94 
    Segmentals 46 0.87 0.02 0.84 0.93 
    Suprasegmentals 24 1.05 0.03 0.99 1.11 
    Segmentals + suprasegmentals 13 1.28 0.04 1.20 1.36 
    Rhythm 1.65 0.06 1.53 1.77 
Use of technology      
    No 44 0.87 0.02 0.83 0.91 
    Yes 16 0.53 0.04 0.46 0.60 
Feedback      
    No 13 0.62 0.02 0.57 0.66 
    Yes 44 0.91 0.02 0.86 0.96 
Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Length      
    Short (≤4.25 h) 25 0.73 0.02 0.68 0.77 
    Long (>4.25 h) 22 0.95 0.03 0.89 1.01 
Treatment provider      
    Teacher 35 0.89 0.02 0.85 0.94 
    Researcher 0.86 0.02 0.83 0.90 
    Teacher-researcher 0.94 0.06 0.81 1.06 
    Computer 12 0.24 0.02 0.19 0.28 
Target features      
    Vowels 27 0.99 0.02 0.95 1.04 
    Consonants 37 0.79 0.02 0.75 0.84 
    Stress (word) 13 1.01 0.05 0.91 1.12 
    Stress (sentence) 10 1.39 0.07 1.26 1.53 
    Intonation 15 0.38 0.05 0.73 0.94 
    Segmentals 46 0.87 0.02 0.84 0.93 
    Suprasegmentals 24 1.05 0.03 0.99 1.11 
    Segmentals + suprasegmentals 13 1.28 0.04 1.20 1.36 
    Rhythm 1.65 0.06 1.53 1.77 
Use of technology      
    No 44 0.87 0.02 0.83 0.91 
    Yes 16 0.53 0.04 0.46 0.60 
Feedback      
    No 13 0.62 0.02 0.57 0.66 
    Yes 44 0.91 0.02 0.86 0.96 

The third and final set of variables we examined as potential moderators of the effects of PI were different types of outcome measures ( Tables 6 and 7 ). We first compared effects resulting from outcomes that involved free production vs. controlled, the latter of which yielded much larger effects in both within- and between-group contrasts. Controlled production was also by far the preferred outcome type for both within- and between-groups designs ( k = 75 of 110 samples and 44 of 60). The patterns for rater effects and different item lengths in outcome measures are also strong. Posttreatment production assessed by native speakers of the target language was approximately twice as large as when rated by nonnative speakers in within-group contrasts. With respect to different item types, effects for within- and between-group contrasts were quite different. Whereas the former found larger effects increasing along with longer item lengths, the opposite pattern was observed in the latter (i.e. words > sentences > discourse).

Table 6:

Moderator analyses across outcome types (within-group contrasts)

Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Outcome type      
    Controlled 75 0.96 0.03 0.90 1.02 
    Free 18 0.65 0.03 0.59 0.71 
    Both 16 0.86 0.05 0.77 0.95 
Rater      
    Nonnative speaker(s) 10 0.44 0.03 0.38 0.49 
    Native speaker(s) 73 0.93 0.03 0.87 0.98 
Outcome item length      
    Words 30 0.62 0.04 0.55 0.70 
    Sentences 22 0.92 0.04 0.84 1.01 
    Discourse 27 1.23 0.05 1.13 1.34 
    Multiple 27 0.79 0.03 0.68 0.78 
Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Outcome type      
    Controlled 75 0.96 0.03 0.90 1.02 
    Free 18 0.65 0.03 0.59 0.71 
    Both 16 0.86 0.05 0.77 0.95 
Rater      
    Nonnative speaker(s) 10 0.44 0.03 0.38 0.49 
    Native speaker(s) 73 0.93 0.03 0.87 0.98 
Outcome item length      
    Words 30 0.62 0.04 0.55 0.70 
    Sentences 22 0.92 0.04 0.84 1.01 
    Discourse 27 1.23 0.05 1.13 1.34 
    Multiple 27 0.79 0.03 0.68 0.78 
Table 6:

Moderator analyses across outcome types (within-group contrasts)

Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Outcome type      
    Controlled 75 0.96 0.03 0.90 1.02 
    Free 18 0.65 0.03 0.59 0.71 
    Both 16 0.86 0.05 0.77 0.95 
Rater      
    Nonnative speaker(s) 10 0.44 0.03 0.38 0.49 
    Native speaker(s) 73 0.93 0.03 0.87 0.98 
Outcome item length      
    Words 30 0.62 0.04 0.55 0.70 
    Sentences 22 0.92 0.04 0.84 1.01 
    Discourse 27 1.23 0.05 1.13 1.34 
    Multiple 27 0.79 0.03 0.68 0.78 
Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Outcome type      
    Controlled 75 0.96 0.03 0.90 1.02 
    Free 18 0.65 0.03 0.59 0.71 
    Both 16 0.86 0.05 0.77 0.95 
Rater      
    Nonnative speaker(s) 10 0.44 0.03 0.38 0.49 
    Native speaker(s) 73 0.93 0.03 0.87 0.98 
Outcome item length      
    Words 30 0.62 0.04 0.55 0.70 
    Sentences 22 0.92 0.04 0.84 1.01 
    Discourse 27 1.23 0.05 1.13 1.34 
    Multiple 27 0.79 0.03 0.68 0.78 
Table 7:

Moderator analyses across outcome types (between-group contrasts)

Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Outcome type      
    Controlled 44 0.96 0.03 0.89 1.00 
    Free 0.37 0.04 0.30 0.44 
    Multiple 10 0.61 0.02 0.57 0.66 
Rater      
    Nonnative speaker(s) 0.86 0.06 0.74 0.97 
    Native speaker(s) 39 0.70 0.02 0.66 0.74 
Outcome item length      
    Words 14 1.16 0.04 1.08 1.25 
    Sentences 18 0.87 0.04 0.80 0.95 
    Discourse 10 0.23 0.03 0.18 0.29 
    Multiple 17 0.68 0.02 0.65 0.71 
Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Outcome type      
    Controlled 44 0.96 0.03 0.89 1.00 
    Free 0.37 0.04 0.30 0.44 
    Multiple 10 0.61 0.02 0.57 0.66 
Rater      
    Nonnative speaker(s) 0.86 0.06 0.74 0.97 
    Native speaker(s) 39 0.70 0.02 0.66 0.74 
Outcome item length      
    Words 14 1.16 0.04 1.08 1.25 
    Sentences 18 0.87 0.04 0.80 0.95 
    Discourse 10 0.23 0.03 0.18 0.29 
    Multiple 17 0.68 0.02 0.65 0.71 
Table 7:

Moderator analyses across outcome types (between-group contrasts)

Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Outcome type      
    Controlled 44 0.96 0.03 0.89 1.00 
    Free 0.37 0.04 0.30 0.44 
    Multiple 10 0.61 0.02 0.57 0.66 
Rater      
    Nonnative speaker(s) 0.86 0.06 0.74 0.97 
    Native speaker(s) 39 0.70 0.02 0.66 0.74 
Outcome item length      
    Words 14 1.16 0.04 1.08 1.25 
    Sentences 18 0.87 0.04 0.80 0.95 
    Discourse 10 0.23 0.03 0.18 0.29 
    Multiple 17 0.68 0.02 0.65 0.71 
Grouping variables and values k M ( d )   SE  95 per cent CIs
 
Lower
 
Upper
 
Outcome type      
    Controlled 44 0.96 0.03 0.89 1.00 
    Free 0.37 0.04 0.30 0.44 
    Multiple 10 0.61 0.02 0.57 0.66 
Rater      
    Nonnative speaker(s) 0.86 0.06 0.74 0.97 
    Native speaker(s) 39 0.70 0.02 0.66 0.74 
Outcome item length      
    Words 14 1.16 0.04 1.08 1.25 
    Sentences 18 0.87 0.04 0.80 0.95 
    Discourse 10 0.23 0.03 0.18 0.29 
    Multiple 17 0.68 0.02 0.65 0.71 

DISCUSSION

This study examined the overall effects of PI and potential moderators of those effects. The discussion that follows summarizes and interprets the results, contextualizing the findings with respect to the domain in question as well as more generally within instructed SLA. We also take advantage of the meta-analytic data set to critique and suggest methodological improvements in PI research.

In terms of overall effects of PI, the (weighted) within-group results showed that the learners who received instructional treatments improved by 0.89 standard deviation units in comparison with their pretreatment performance; the between-group analyses demonstrated that learners in experimental groups outperformed those in control groups by 0.80 standard deviation units. Because the confidence intervals of the effect sizes do not cross zero, both can be said to be statistically reliable. Nevertheless, individual effects across the sample vary (from −0.36 to 3.98 and −0.66 to 3.12 for within- and between-group contrasts, respectively). It is also interesting to note that the range in observed effects has increased over time. The range of effects from 1982 to 1989, 1990 to 1997, 1998 to 2005, 2006 to 2013 was 0.93, 1.83, 1.90, and 4.34 for within-group contrasts and 0.66, 0.77, 1.67, and 3.78 for between-group contrasts. Greater variability in results may be indicative of interest and empirical efforts addressing an increasingly larger variety of pronunciation features and instructional approaches.

According to Plonsky and Oswald’s (2014) scale for interpreting d values in L2 research, the overall findings of this study represent medium to large effects. Compared with meta-analytic findings in other areas of instructed SLA, these results show that instruction on pronunciation can be just as (or more) effective as vocabulary, grammar, and pragmatics (see Figure 1 ). However, in light of evidence in this study in favor of a bias toward statistically significant results observed in the funnel plot, we might consider the overall findings to overestimate true population effects.

Drawing on theoretical and practical concerns, we also examined variability in the effects of PI as a function of three categories of potential moderating variables: contexts, treatments, and outcomes. Among other results across contexts of PI, learner age (level of education) was found to be related to treatment effects. At first glance, these findings might appear to support what we would expect for the role of age and L2 pronunciation development. A more nuanced interpretation of this result, however, would also consider the fact that age effects are generally much stronger in second vs. foreign language contexts, where exposure is limited almost exclusively to classroom instruction ( Trofimovich et al. 2009 ; Muñoz 2011 ). Furthermore, because of the lack of available evidence in primary studies, these results are not based on studies with learners within what is typically considered to be the critical period (0 to around 12). Our findings with respect to age and the effects of PI should therefore not be considered conclusive.

Unlike the findings for age, no clear pattern was found for the effects of PI across proficiency levels. Practically speaking, these findings suggest that learners at different proficiencies can all benefit from PI. The lack of a clear relationship between proficiency and the effects of PI might also be attributed, at least in part, to the challenges inherent in reliably and validly identifying proficiency at the primary and meta-analytic levels. As in previous meta-analyses, we were limited in our ability to determine and code for proficiency by what primary authors reported.

Replicating the findings from several previous meta-analysis of instructed SLA (e.g. Li 2010 ; Plonsky 2011 ), our results show that laboratory-based PI may produce stronger effects than when carried out in intact classes. Both, however, are effective. Interestingly, the choice of setting for PI research appears to have changed in this domain. Studies of PI have migrated over time from laboratories to classrooms, a shift often seen in other social sciences where experimental effects are explored in low-stakes environments before testing them in applied contexts such as classrooms ( Oswald and Plonsky 2010 ).

Our results also reveal several trends across different types of PI treatments. First, as we might expect, longer treatments (i.e. longer than the median intervention of 4.25 h) generally produced larger effects. This finding confirms Saito’s (2012) synthesis which found that one of only two studies not showing an effect of PI included a treatment of only 15–30 min ( Macdonald et al. 1994 ; Note: This study was not included in our analysis owing to missing/unreported data). It also replicates results of previous meta-analyses of instructed SLA examining the effects of treatment length (e.g. Jeon and Kaya 2006 ). Because of the potential for this body of research to inform L2 pedagogy, the practical significance of this finding, along with many others, should be given critical consideration. (See Plonsky and Oswald 2014 , for a discussion on weighing practical significance against resources such as time and experimental manipulation required to induce effects.)

Another treatment feature associated with larger effects is the provision of feedback. Hundreds of primary studies and 18 meta-analyses of feedback research (see Plonsky and Brown 2014 ) have shown positive effects for feedback. This massive body of research, however, has almost exclusively considered feedback on lexical and morphosyntactic errors. As a point of theoretical interest, the results of this study suggest that previous findings with respect to feedback are robust to the domain of L2 pronunciation as well. From a practical standpoint, these results also show that including feedback in a program of PI can improve its effectiveness ( Saito and Lyster 2012 ). Given the robustness of feedback effects found for other target domains (i.e. grammar and vocabulary), this finding is perhaps not surprising. It is, however, worth noting that prior to this study aggregate findings for this effect were lacking in the synthetic literature.

The opposite was found for the use of technology and computer-delivered PI. Studies that provided PI using technology, whether entirely or in part, produced smaller effects than those that relied exclusively on human-delivered instruction. The lack of adaptability and perceptual accuracy in computers compared to human teachers, and perhaps consequently their ability to provide appropriate feedback as well, may partly explain this finding. Although the accessibility of computer-delivered PI has great potential, there is clearly a need for research seeking to improve technology-enhanced instructional materials.

Sorely missing from our results is a more fine-grained analysis of the effects of different types of pedagogical practices. The norm in this body of research was to simply refer to a general approach such as Celce–Murcia et al. ’s (1996) five stages for PI. While reading the Methods sections, we were often left wondering about the details of instructional materials and activities. For example, did they consist of decontextualized drills? Was PI embedded in meaning-oriented tasks? And to what extent did the pedagogy match researchers’ efforts to assess learner development? Future studies would do well to include greater procedural detail in written reports.

This study also examined the relative effects of PI across a range of targeted linguistic features. Much like Saito (2012) , our study found relatively homogenous effects of PI on different features. Our results also add precision to Saito’s. That is, the vote-count procedure he employed indicated only a consistently positive direction of effects for instruction on segmentals and suprasegmentals. The present study goes further by providing both the direction (positive) and magnitude of such effects, which are relatively strong and stable. We view these findings positively, implying that PI can be effective for a wide variety of features. Furthermore, in light of the larger effects observed when PI targeted both segmental and suprasegmental features (as opposed to either one independently), we echo other scholars’ recommendation that L2 practitioners consider including a variety of features in their curricula (c.f. Kang et al. 2010 ). Rather than focus exclusively on the often-debated segmental/suprasegmental distinction, the results of our study support an approach that treats sets of features that align with learners’ needs, backgrounds, and first languages ( Saito 2012 ).

In addition to different contexts and treatments in PI research, we also examined the moderating effects of different outcome types. Again, as in previous meta-analyses of instructed SLA research (e.g. Norris and Ortega 2000 ; Spada and Tomita 2010 ), our findings show that the choice of outcome measure can affect study results. Specifically, studies employing more controlled outcome measures/items produced larger effects than more open-ended ones. Although the latter are likely more representative of learners’ true ability, the former are perhaps more similar to practice activities carried out during experimental treatments. Further complicating this matter is the fact that most studies relied on outcome measures of a very controlled nature (e.g. reading lists of individual words or sentences). In order to get a more fulsome understanding of treatment effects, future studies of PI should follow the recent trend in L2 vocabulary and grammar research by including different types of outcome measures (see Mackey and Goo 2007 ; Spada and Tomita 2010 ). Results from less controlled instruments may be more difficult to analyze, but practical challenges are a small price to pay for authenticity and ecological validity.

One final and critical consideration relevant to this discussion is instrument reliability in PI research. The main issue here is not—as far as we can tell—low reliability. Rather, it is the lack of availability of reliability in study reports (only 47 per cent of the sample), which limits our ability to accurately interpret study results. Although there is clearly room for improvement here, the presence of reliability estimates in PI research is greater than in several other previously meta-analyzed domains of SLA, where they have been found anywhere from 6 per cent (L2 practice; Nekrasova and Becker 2009 ) to 64 per cent (L2 interaction; Plonsky and Gass 2011 ).

Critiques and suggestions for future research

A meta-analytic data set not only enables the researcher to critique the domain in question; it is his/her duty to do so. We therefore conclude our discussion with a list of limitations observed in our sample. In order to move the domain toward areas that merit attention as well as improved methodological practice, each critique is accompanied by suggestions for future research.

Critique 1: The validity of PI research both in individual studies and in the aggregate is threatened by the use of very small samples and correspondingly low statistical power. A post hoc power analysis based on the results of this study shows observed power within and between groups to be just 0.66 and 0.55, respectively. Underpowered studies not only limit our ability to detect true effects, they also lead to an uneven depiction of population effects (via publication bias) when summarized at the secondary level.

Fortunately, there is a relatively (if also deceptively) simple two-part solution to this problem. First, PI research, like much of SLA, needs larger samples. In some cases, owing to practical constraints of recruiting participants, this will require less subgroup comparisons. However, more reliable results are certainly preferable to a greater number of less reliable ones. And secondly, PI researchers ought to move away from the dichotomous thinking embodied by null hypothesis significance testing, focusing instead on point estimates and their practical significance as expressed by effect sizes ( Norris and Ortega 2006 ; Plonsky 2013 ). Considering only 17 studies (20 per cent of the sample) reported effect sizes, and almost none provided a useful interpretation of those effects, change in this direction may be slow.

Critique 2: Sampling in PI research is not only underpowered, it lacks diversity, particularly in terms of different ages, first languages, and target languages. Whereas the previous critique poses a threat to internal validity, this problem puts into question the external validity or generalizability of PI research. Only 4 of 86 primary reports in our sample involved participants less than 13 years old. Considering many pronunciation errors are L1 and L2 specific (e.g. Derwing and Munro 2013 ), it is perhaps even more concerning that English was either the participants’ first language or the target language in 83 of the 86 studies.

The way forward here for PI researchers is to consider recruiting younger participants as well as learners other than those whose L1 or L2 is English. Doing so may require reaching out to and making connections with teachers and researchers outside of our home institutions.

Critique 3: Three features of PI designs are in need of improvement. First, only 14 per cent of the sample examined the longevity of effects by means of a delayed posttest. Interestingly, studies in this sample that included delayed posttests in their design produced larger effects, a pattern also observed in Plonsky ( 2011 , 2013 ) and Plonsky and Gass (2011) . In order to determine the practical significance of PI in ‘real-world’ settings, where learning matters beyond a brief intervention, delayed posttests must be incorporated into future studies. This practice would also contribute to theoretical discussions related to the durability of instructional treatments. Secondly, pre–post designs far outnumber controlled experiments. This trend is likely due in part to the use of intact samples where it may be inappropriate or even unethical to withhold treatment for the sake of experimental control. Although pre–post designs can help guide L2 practitioners’ expectations, absolute effects can only be measured through more rigorously controlled designs. Third, PI research relies too heavily—primarily even—on controlled outcome measures, thus again limiting the external validity of study results. There is clearly a need for greater variety of outcome measures in PI research.

Critique 4: Two substantive issues observed in this body of research are also worth noting. First is a lack of attention to a number of phonetic and phonological features such as articulation, elision, linking, and stress. And secondly, the interactions between different treatments and learner backgrounds (i.e. aptitude-treatment interaction research, or ATI) present a potential source of findings in PI with relevance for L2 theory and practice. To date, however, a very small number of studies have examined such interactions (e.g. Elliot 1997 ) unlike the rapidly growing body of ATI research in the realm of grammar instruction (e.g. Li 2013 ).

FUNDING

This work was supported in part by the Hankuk University of Foreign Studies Research Fund given to Junkyu Lee

Conflict of interest statement . None declared.

NOTES

1 Vote-counting’ is a type of synthesis wherein the researcher, as in meta-analysis, systematically collects and codes primary studies in a given domain. Unlike meta-analysis, which produces a quantitative indication of the magnitude of the relationship in question, however, the results of vote-counting speak only to the direction of effects.
2 Our sincere thanks to the authors of the five reports who provided us with the data we needed to include their studies in our analysis: Walcir Cardoso; Manuela Gonzalez–Bueno and Marcela Quintana Lara; Rebecca Hincks; Gillian Lord; and Eleni Tsiartsioni.

REFERENCES

Bertram
S
A Case Study of the Noticing-Reformulation Technique
 , 
2008
Unpublished MA thesis. Hamline University
Celce–Murcia
M
Brinton
D
Goodwin
J
Teaching Pronunciation: A Reference for Teachers of English to Speakers of Other Languages
 , 
1996
Cambridge University Press
Chiu
Y -H
Computer-assisted second language vocabulary instruction: A meta-analysis
British Journal of Educational Technology
 , 
2013
, vol. 
44
 (pg. 
E52
-
6
)
Darcy
I
Ewert
D
Lidster
R
Levis
J
Lavelle
K
Bringing pronunciation instruction back into the classroom. An ESL teachers’ pronunciation ‘toolbox
Proceedings of the 3rd Pronunciation in Second Language Learning and Teaching Conference
 , 
2012
Iowa State University
(pg. 
93
-
108
)
Derwing
T M
What do ESL students say about their accents?
The Canadian Modern Language Review
 , 
2003
, vol. 
59
 (pg. 
547
-
66
)
Derwing
T M
Munro
M J
Second language accent and pronunciation teaching: A research-based approach
TESOL Quarterly
 , 
2005
, vol. 
39
 (pg. 
379
-
97
)
Derwing
T M
Munro
M J
The development of L2 oral language skills in two L1 groups: A 7-year study
Language Learning
 , 
2013
, vol. 
63
 (pg. 
163
-
85
)
Derwing
T M
Munro
M
Wiebe
G
Evidence in favor of a broad framework for pronunciation instruction
Language Learning
 , 
1998
, vol. 
48
 (pg. 
393
-
410
)
Derwing
T M
Thompson
R I
Munro
M J
English pronunciation and fluency development in Mandarin and Slavic speakers
System
 , 
2006
, vol. 
34
 (pg. 
183
-
93
)
Elliot
A S
On the teaching and acquisition of pronunciation within a communicative approach
Hispania
 , 
1997
, vol. 
80
 (pg. 
95
-
109
)
Flege
J E
Yeni-Komshian
G H
Liu
S
Age constraints on second-language acquisition
Journal of Memory and Language
 , 
1999
, vol. 
41
 (pg. 
78
-
104
)
Goo, J., G., Granena, Y. Yilmaz, and M. Novella (in press). Implicit and explicit instruction in L2 learning: Norris & Ortega (2000) revisited and updated. In P. Rebuschat (Ed.), Implicit and Explicit Learning of Languages . John Benjamins
Gordon
J
Darcy
I
The development of comprehensible speech in L2 learners: Effects of explicit pronunciation instruction on segmentals and suprasegmentals
2012
 
Paper presented at AAAL. Boston, MA
Hahn
L D
Primary stress and intelligibility: Research to motivate the teaching of suprasegmentals
TESOL Quarterly
 , 
2004
, vol. 
38
 (pg. 
201
-
23
)
Hardison
D M
Contextualized computer-based L2 prosody training: Evaluating the effects of discourse context and video input
CALICO Journal
 , 
2005
, vol. 
22
 (pg. 
175
-
90
)
Ingels
S
Levis
J
LeVelle
K
The effects of self-monitoring strategy use on the pronunciation of learners of English
Proceedings of the 1st Pronunciation in Second Language Learning and Teaching Conference
 , 
2010
Iowa State University
(pg. 
67
-
89
)
Isaacs
T
Thomson
R I
Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions
Language Assessment Quarterly
 , 
2013
, vol. 
10
 (pg. 
135
-
59
)
Isaacs
T
Trofimovich
P
Deconstructing comprehensibility: Identifying the Linguistic Influences on Listeners’ L2 Comprehensibility Ratings
Studies in Second Language Acquisition
 , 
2012
, vol. 
34
 (pg. 
475
-
505
)
Jeon
E H
Kaya
T
Norris
J M
Ortega
L
Effects of L2 instruction on interlanguage pragmatic development: A meta-analysis
Synthesizing Research on Language Learning and Teaching
 , 
2006
John Benjamins
(pg. 
165
-
211
)
Kang
O
Relative salience of suprasegmental features on judgments of L2 comprehensibility and accentedness
System
 , 
2010
, vol. 
38
 (pg. 
301
-
15
)
Kang
O
Impact of rater characteristics on ratings of international teaching assistants’ oral performance
Language Assessment Quarterly
 , 
2012
, vol. 
9
 (pg. 
249
-
69
)
Kang
O
Rubin
D
Pickering
L
Suprasegmental measures of accentedness and judgments of language learner proficiency in oral English
Modern Language Journal
 , 
2010
, vol. 
94
 (pg. 
554
-
66
)
Krashen
D
Principles and Practices in Second Language Acquisition
 , 
1982
Pergamon
Lee
S T
Teaching pronunciation using computer-assisted learning software: An action research studies in an institute of technology in Taiwan
2008
 
EdD dissertation, Australian Catholic University
Levis
J
Changing contexts and shifting paradigms in pronunciation teaching
TESOL Quarterly
 , 
2005
, vol. 
39
 (pg. 
369
-
77
)
Li
S
The effectiveness of corrective feedback in SLA: A meta-analysis
Language Learning
 , 
2010
, vol. 
60
 (pg. 
309
-
65
)
Li
S
The interactions between the effects of implicit and explicit feedback and individual differences in language analytic ability and working memory
Modern Language Journal
 , 
2013
, vol. 
97
 (pg. 
634
-
54
)
Lord
G
Podcasting communities and second language pronunciation
Foreign Language Annals
 , 
2008
, vol. 
41
 (pg. 
374
-
89
)
MacDonald
D
Yule
G
Powers
M
Attempts to improve English L2 pronunciation: The variable effects of different types of instruction
Language Learning
 , 
1994
, vol. 
44
 (pg. 
75
-
100
)
Mackey, A., and J. Goo (2007). Interaction research in SLA: A meta-analysis and research synthesis. In A. Mackey (Ed.), Conversational Interaction in Second Language Acquisition: A Collection of Empirical Studies . Oxford University Press, pp. 407–51
Miller
J S
Teaching French pronunciation with phonetics in college-level beginner French course
The NECTFL Review
 , 
2012
, vol. 
69
 (pg. 
47
-
68
)
Miller
J S
Levis
J
LeVelle
K
Improving oral proficiency by raising metacognitive awareness with recordings
Proceedings of the 4th Pronunciation in Second Language Learning and Teaching Conference
 , 
2013
Iowa State University
(pg. 
101
-
11
)
Muñoz
C
Input and long-term effects of starting age in foreign language learning
International Review of Applied Linguistics in Language Teaching
 , 
2011
, vol. 
71
 (pg. 
197
-
220
)
Nekrasova
T
Becker
T
Effectiveness of practice: A research synthesis and quantitative meta-analysis
2009
 
Unpublished manuscript
Norris
J M
Ortega
L
Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis
Language Learning
 , 
2000
, vol. 
50
 (pg. 
417
-
528
)
Norris
J M
Ortega
L
Norris
JM
Ortega
L
The value and practice of research synthesis for language learning and teaching
Synthesizing Research on Language Learning and Teaching
 , 
2006
John Benjamins
(pg. 
3
-
50
)
Oswald
F L
Plonsky
L
Meta-analysis in second language research: Choices and challenges
Annual Review of Applied Linguistics
 , 
2010
, vol. 
30
 (pg. 
85
-
110
)
Plonsky
L
The effectiveness of second language strategy instruction: A meta-analysis
Language Learning
 , 
2011
, vol. 
61
 (pg. 
993
-
1038
)
Plonsky
L
Study quality in SLA: An assessment of designs, analyses, and reporting practices in quantitative L2 research
Studies in Second Language Acquisition
 , 
2013
, vol. 
35
 (pg. 
655
-
87
)
Plonsky
L
Brown
D
Domain definition and search techniques in meta-analyses of L2 research (Or why 18 meta-analyses of feedback have different results)
Second Language Research
 , 
2015
, vol. 
31
 (pg. 
267
-
78
)
Plonsky
L
Gass
S
Quantitative research methods, study quality, and outcomes: The case of interaction research
Language Learning
 , 
2011
, vol. 
61
 (pg. 
325
-
66
)
Plonsky
L
Oswald
F L
Mackey
A
Gass
S M
How to do a meta-analysis
Research Methods in Second Language Acquisition: A Practical Guide
 , 
2012
Wiley Blackwell
(pg. 
275
-
95
)
Plonsky
L
Oswald
F L
How big is ‘big’? Interpreting effects sizes in L2 research
Language Learning
 , 
2014
, vol. 
64
 (pg. 
878
-
91
)
Plonsky
L
Egbert
J
LaFlair
G T
Bootstrapping in applied linguistics: Assessing its potential using shared data
Applied Linguistics
 , 
in press
 
doi:10.1093/applin/amu001
Saito
K
Effects of instruction on L2 pronunciation development: A synthesis of 15 quasi-experimental intervention studies
TESOL Quarterly
 , 
2012
, vol. 
46
 (pg. 
842
-
54
)
Saito
K
Experienced teachers’ perspectives on priorities for improved intelligible pronunciation: The case of Japanese learners of English
International Journal of Applied Linguistics
 , 
2014
, vol. 
24
 (pg. 
250
-
27
)
Saito
K
Lyster
R
Effects of form-focused instruction and corrective feedback on L2 pronunciation development of/r/by Japanese learners of English
Language Learning
 , 
2012
, vol. 
62
 (pg. 
595
-
633
)
Shintani, N. (2015). ‘The effectiveness of processing instruction on L2 grammar acquisition: A meta-analysis,’ Applied Linguistics 36/3: 306–25
Shintani
N
Li
S
Ellis
R
Comprehension-based versus production-based grammar instruction: A meta-analysis of comparative
Language Learning
 , 
2013
, vol. 
63/2
 (pg. 
296
-
329
)
Spada
N
Tomita
Y
Interactions between type of instruction and type of language feature: A meta-analysis
Language Learning
 , 
2010
, vol. 
60
 (pg. 
263
-
308
)
Tokumoto
M
Shibata
M
Asian varieties of English: Attitudes towards pronunciation
World Englishes
 , 
2011
, vol. 
30
 (pg. 
392
-
408
)
Trofimovich
P
Lightbown
P M
Halter
R H
Comprehension-based practice: The development of L2 pronunciation in a listening and reading program
Studies in Second Language Acquisition
 , 
2009
, vol. 
31
 (pg. 
609
-
39
)
Tsiartsioni
E
Psaltou-Joycey
A
Mattheoudaki
M
The effectiveness of pronunciation teaching to Greek state school students
Selected Papers from the Proceedings of the 14th International Conference of the Greek Applied Linguistics Association
 , 
2010
GALA
(pg. 
429
-
46
)
VanPatten
B
Processing instruction: An update
Language Learning
 , 
2002
, vol. 
52
 (pg. 
755
-
803
)
Wa-Mbaleka
S
A Meta-analysis Investigating the Effects of Reading on Second Language Vocabulary Learning
 , 
2006
Unpublished doctoral dissertation, Northern Arizona University
Won
M
The Effects of Vocabulary Instruction on English Language Learners: A Meta-analysis
 , 
2008
Unpublished doctoral dissertation, Texas Tech University
Yates, K. 2003. 'Teaching linguistic mimicry to improve second language pronunciation,' Masters thesis, University of North Texas

Supplementary data