Read Lee, Jang, & Plonsky (2015) and provide information about the:
1) Topic and domain of the meta-analysis;
2) Sources used for the lit review and identifying studies;
3) Inclusion and exclusion criteria;
4) Information that was coded;
5) Kind of effect sizes computed;
6) Main findings.
Applied Linguistics 2015: 36/3: 345–366
ß Oxford University Press 2014
doi:10.1093/applin/amu040 Advance Access published on 25 July 2014
The Effectiveness of Second Language
Pronunciation Instruction: A Meta-Analysis
1
1
Graduate School of Education, Hankuk University of Foreign Studies, Seoul,
South Korea, 2Department of TESOL, Hankuk University of Foreign Studies, Seoul,
South Korea and 3Department of English, Northern Arizona University, AZ, USA
*E-mail: luke.plonsky@nau.edu
The goal of this study was to determine the overall effects of pronunciation
instruction (PI) as well as the sources and extent of variance in observed effects.
Toward this end, a comprehensive search for primary studies was conducted,
yielding 86 unique reports testing the effects of PI. Each study was then coded
on substantive and methodological features as well as study outcomes (Cohen’s d).
Aggregated results showed a generally large effect for PI (d = 0.89 and 0.80 for
N-weighted within- and between-group contrasts, respectively). In addition,
moderator analyses revealed larger effects for (i) longer interventions, (ii) treatments providing feedback, and (iii) more controlled outcome measures. We interpret these and other results with respect to their practical and pedagogical
relevance. We also discuss the findings in relation to instructed second language
acquisition research generally and in comparison with other reviews of PI (e.g.
Saito 2012). Our conclusion points out areas of PI research in need of further
empirical attention and methodological refinement.
INTRODUCTION
The effectiveness of second language pronunciation instruction:
a meta-analysis
Pronunciation instruction (PI) is one of several areas in the domain of instructed second language acquisition (SLA) that carries significant potential
to inform both theory and practice. It is not surprising, therefore, that research
on the effects of PI has been extensive, despite frequent commentary claiming
the contrary (e.g. Derwing and Munro 2005). This line of research has examined PI across many learners and contexts (e.g. various target languages and
proficiency levels), pedagogical approaches (with vs. without feedback), linguistic features (e.g. segmentals vs. suprasegmentals), and outcome types (i.e.
constrained vs. guided vs. open-ended). Findings from studies in this area are
summarized regularly in review articles and handbooks (e.g. Saito 2012).
However, given the qualitative and non-comprehensive nature of such reviews, it is difficult to ascertain with certainty and precision the overall effects
of PI. It is even more difficult, if not impossible, to determine with any precision the extent to which different factors may moderate the effects of PI, much
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
JUNKYU LEE, 2JUHYUN JANG and 3,*LUKE PLONSKY
346
THE EFFECTIVENESS OF SECOND LANGUAGE PRONUNCIATION INSTRUCTION
LITERATURE REVIEW
Pronunciation and instructed SLA
The effectiveness of L2 instruction has been the object of extensive empirical
investigation in the field of SLA. Researchers have examined the effects of
instruction on a wide range of L2 features and skills including grammar/morphosyntax, vocabulary, pragmatics, and the focus of this synthesis, pronunciation (e.g. Derwing and Munro 2005; Saito 2012). Early research in these areas
was often concerned with the rather broad question of whether or not (explicit) instruction led to L2 development, compared with input-only or meaningbased approaches such as those advocated by Krashen (1982), VanPatten
(2002), and others. (For current reviews, see Shintani 2015, and Shintani
et al. 2013). Extensive research since the 1980s, and Norris and Ortega’s
(2000) meta-analysis in particular, however, has largely put this debate to
rest. Empirical efforts have since turned to the generalizability of instructional
effects. That is, studies have looked at instructed L2 acquisition as a function of
different learner backgrounds and contexts (e.g. second vs. foreign language
setting), different types of linguistic features (e.g. simple vs. complex), and
different types of instruction (e.g. explicit vs. implicit), among other variables.
With the exception of pronunciation, these same subdomains and the questions they address have been meta-analyzed, often multiple times. Figure 1
presents summary effects from these studies, organized according to the linguistic target of instruction: grammar, vocabulary, and pragmatics. Several
results across meta-analyses of L2 instruction are worth noting. First, there
is clearly substantial variability in observed effects both across and within
different subdomains of instructed SLA; meta-analytic d values range from
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
less interpret their implications for pedagogy. As Darcy et al. (2012) have
stated, ‘there is no agreed upon system of deciding what [pronunciation features] to teach, and when and how to do it’. And unlike all other linguistic foci
targeted by second language (L2) instruction—grammar, vocabulary, pragmatics—quantitative results from this body of research have yet to be synthesized via meta-analysis, hence pointing to the need for this study as a means to
inform our understanding of PI as well as for more general development of L2
theory and practice.
The literature review that follows is broken into two main parts: first, we
provide a brief outline of research on the effectiveness of L2 instruction as
demonstrated across other L2 features, highlighting the findings of previous
meta-analyses of these areas as they relate to and inform the present study. We
then move on to the focus of this study, PI, and a description of its theoretical
and practical importance. We also provide an overview of empirical investigations in this area organized around different contexts, treatments, and outcomes that vary in this body of research and that are suggested to moderate the
effects of PI.
J. LEE, J. JANG, AND L. PLONSKY
Spada and
a Tomita (2010; k = 30)
0.33
”
0.39
0
0.73
”
0.88
G
Goo et al. (inn press; k = 35)
0.87
Norris and
a Ortega (2000; k = 45)
0.96
Shiintani et al. (2013; k = 30)
3
1.13
”
1..23
”
1.32
Prag.
Vocabulary
”
1.96
1
Won (2008; k = 30)
0.69
Chiu (2013; k = 16)
0.75
(2006; k = 34)
Wa-Mbaleka
W
1.43
n and Kaya (2006; k = 13)
Jeon
0.59
0
0.5
1
1.5
2
Meta-annalytic d value
Figure 1: Overall/meta-analytic effects of instructed SLA across subdomains
one-third of a standard deviation (implicit instruction on simple grammar
forms; Spada and Tomita 2010) to a difference between control and experimental groups of nearly two standard deviations [as found in Shintani et al.’s
(2013) results for receptively measured effects of comprehension-based grammar instruction]. This set of results also indicates the extent to which different
subdomains of instructed SLA research have been summarized via meta-analysis. Whereas the effects of instruction on L2 grammar and vocabulary have
been documented fairly well at the meta-analytic level, this is not the case for
pragmatics instruction. No study to date has meta-analyzed the effects of PI,
the focus of the present study. This gap in the literature limits our understanding of the merit of PI in L2 classes as well as of instruction in this area relative
to other target features and subdomains such as those shown above.
Our interest in this study, however, is not solely in the overall extent to
which this body of research has led to improvements in L2 pronunciation. As
with other domains of L2 instruction, the effects of PI are likely to vary as a
function of different substantive and methodological features. In the remainder of the literature review, we therefore describe different contexts, treatments, and outcomes with respect to their potential role in moderating the
effects of PI, highlighting relevant studies when appropriate.
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
Grammar
”
347
348
THE EFFECTIVENESS OF SECOND LANGUAGE PRONUNCIATION INSTRUCTION
The role of contextual and learner factors in PI
Treatment and target features
One of the most critical considerations with respect to designing interventions
that seek to improve L2 pronunciation is the type of feature(s) to target. A
great deal of discussion in this area surrounds the relative effectiveness of
instruction on segmental vs. suprasegmental features. Levis (2005) and Saito
(2014) have suggested that segmental phonology may be easier for teachers to
teach and for learners to learn. Others (e.g. Hahn 2004), however, claim instruction on suprasegmental features to be more effective. The importance of
instruction on suprasegmentals is also underscored by their impact on comprehensibility and accentedness (e.g. Kang 2010; Isaacs and Trofimovich
2012).
Very few empirical investigations have addressed the relative effectiveness of
PI on these two feature types. In Derwing et al. (1998), one group received
instruction on segmental features (e.g. individual sound contrasts) and another
on suprasegmental features (rhythm, intonation, and stress). In comparison
with a control group that received no pronunciation-specific instruction, both
groups improved on perceived accentedness and comprehensibility as
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
As shown in numerous studies and meta-analyses in other SLA subdomains,
instructional context and learner background can greatly influence the impact
of a pedagogical intervention (e.g. Plonsky and Oswald 2014). Such variables
might include participants’ age, proficiency level, type of educational institution, second vs. foreign language environment, and whether the study is carried out in a laboratory or classroom setting.
Given evidence in favor of a critical period for phonological development
(e.g. Flege et al. 1999), the role of age may be particularly strong in the case of
pronunciation. Specifically, although PI is more often tested with adult learners, we might predict larger effects for studies involving children (e.g.
Trofimovich et al. 2009; Tsiartsioni 2010). We might also expect to find
larger effects in laboratory-based (as opposed to classroom-based) studies
owing to increased experimental control in the former. Li (2010), for example,
found that the average effect of corrective feedback in laboratory-based studies
(d = 1.08) was more than twice that of classroom-based studies (d = 0.50).
Likewise, PI may lead to larger effects in second-language settings than foreign
language settings owing to the value attributed to speaking and sounding native-like in the former (Derwing 2003; but cf. Tokumoto and Shibata 2011).
Finally, the effectiveness of PI may also be related to the proficiency of the
participants. Derwing and Munro (2005), for example, argue that instruction
yields more rapid improvement in lower-level learners. More advanced learners who possess foundational knowledge of pronunciation as well as other
skills, however, may be able to integrate and adapt their pronunciation more
readily.
J. LEE, J. JANG, AND L. PLONSKY
349
Outcome measures of PI
The literature review has thus far focused on independent variables of PI research. In this section, we discuss different types of outcome measures with
respect to their potential to moderate the effects of PI. One feature that may
impact the results of PI is the extent to which items are controlled (i.e. requiring a fixed response from all participants) vs. ‘free’ (i.e. productive measures
that are open-ended, allowing for a variety of different responses). PI researchers may prefer more controlled or shorter items as a means to ensure
that participants produce the target feature(s). Such items, however, may
not accurately represent learners’ ability in carrying out more authentic,
real-world tasks (see Saito and Lyster 2012). Furthermore, the artificiality
and lack of communicative value in controlled and/or word-length tasks
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
measured on a read-aloud task. However, only the suprasegmental group
showed improvement on a less-controlled, picture description task (see discussion of outcome types in the following section). The results of Gordon and
Darcy (2012) and Yates (2003) are even clearer, showing an effect of PI on
suprasegmentals to have almost twice the effect of segmentals. However, using
a ‘vote-count’1 approach to synthesizing research on PI, Saito (2012) found
that studies providing instruction on segmental and suprasegmental features
both generally lead to gains.
In addition to different linguistic foci, several other treatment features may
also be related to the effects of PI. For instance, studies in this area have often
included a technological component. Researchers often use programs/software
such as Anvil or visual input such as spectrograms to provide stimuli, feedback,
minimal pair practice, and so forth. In some cases, technology has been used to
complement teacher- or researcher-delivered instruction (e.g. Lord 2008); in
others, a computer program is the sole provider of instruction (Hardison 2005).
Often occurring in conjunction with technology in the form of adaptive
instruction is feedback. Although many studies have included feedback as
part of a treatment, the study by Saito and Lyster (2012) is perhaps the only
one to have done so in a way that allows the effects of feedback to be measured
directly. Their findings show an advantage for a treatment consisting of formfocused instruction plus feedback (recasts) over form-focused instruction alone
on both controlled and free response outcome measures.
The length of a treatment may also be related to its effectiveness. This feature
is not unique to studies of PI. In fact, several meta-analyses have investigated
summary effects as a function of treatment duration (e.g. pragmatics: Jeon and
Kaya 2006; strategy instruction: Plonsky 2011). As we might expect, these
studies generally find longer treatments to produce stronger effects. Plonsky
and Oswald (2014), however, warn that a strong correlation between treatment length and effect size may put into question the practicality of such
interventions. In other words, instructional costs (time and energy) must be
weighed against their potential benefits for L2 learners.
350
THE EFFECTIVENESS OF SECOND LANGUAGE PRONUNCIATION INSTRUCTION
Research questions
In order to better understand both overall effects of PI and to explain potential
moderators of those effects, the present study addressed the following research
questions:
1 What is the overall effectiveness of instruction on L2 pronunciation?
2 What is the relationship between PI and different contexts, treatment
types, and outcome measures?
METHODS
Study identification
Before searching for studies that might help us answer our research questions,
we defined a set of inclusion/exclusion criteria. In order to be included, a study
had to (i) report the findings of an experiment or quasi-experiment in which
L2 learners were provided with instruction on one or more aspects of pronunciation; (ii) present quantitative results of the study; and (iii) demonstrate the
effects of PI using a pre–post (within groups) and/or control/comparisonexperimental (between groups) design.
Having determined the parameters of our search, we set out to locate relevant primary studies. In doing so, we employed a wide and diverse set of
techniques, accepting redundancy in exchange for comprehensiveness (see
Plonsky and Brown 2015). First, using combinations of keywords (second language, foreign language, pronunciation, and instruction), we searched libraryhoused databases including Educational Resources Information Center,
Linguistics and Language Behavior Abstracts, PsycINFO, PsycArticles, Web of
Science, and ProQuest Dissertations and Theses as well as two nonlibrary databases: Google and Google Scholar. We conducted ancestry searches by examining the references of previous reviews (e.g. Saito 2012) and all candidate
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
may allow learners to focus more on their pronunciation, thus leading to larger
effects (see Elliot 1997; Saito 2012). (See Norris and Ortega 2000; Spada and
Tomita 2010 for discussion and evidence of this phenomenon in the context of
grammar instruction.)
Another outcome feature that may be related to treatment effects is rater
background, an issue subject to extensive investigation in the literature on
L2 pronunciation assessment (e.g. Kang 2012). It has been suggested, and
there is empirical evidence to suggest, that rater characteristics such as
native language, experience working with nonnative speakers, and knowledge
of the target language may affect their evaluations of L2 pronunciation (e.g.
Isaacs and Thomson 2013). These factors tend to play a more prominent role
in ratings of pronunciation compared with assessments of grammar and
vocabulary.
J. LEE, J. JANG, AND L. PLONSKY
351
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
studies. We contacted authors and gratefully received manuscripts from two
individual authors (Veronica Sardegna and Isabelle Darcy). We manually
searched all four available Proceedings of the Pronunciation in Second
Language Learning and Teaching Conference. Finally, we consulted previously
generated bibliographies of PI, and we examined the professional web pages of
researchers known for their work in this area (Tracey Derwing, Kazuya Saito,
and John Levis).
Our search revealed a number of studies that appeared to meet these criteria
but were excluded for one or more of the following reasons: (i) outcomes other
than pronunciation (e.g. attitudes and overall proficiency) were assessed following PI (e.g. Miller 2013); (ii) duplicate data were presented in a different
report such as a dissertation (e.g. Ingels 2010); (iii) L2 pronunciation development was assessed over time but without a treatment (Derwing et al. 2006);
(iv) only qualitative outcomes were provided (Lee 2008); (v) the design
included neither a pretest nor a control/comparison group (Miller 2012); (vi)
the effects of PI were based on a single participant, thus preventing the calculation of an effect size (Bertram 2008). And finally, numerous studies (25
studies or 29 per cent of the total sample—see below) were excluded owing
to missing data, usually standard deviations. Requests for missing/unreported
data were sent via email to 16 authors: 5 provided the data,2 4 responded that
they could not find the data, and 7 never responded to the request. (For a
discussion on data sharing and transparency, see Plonsky et al. in press.)
Our search led to 86 study reports that were included in the final analysis
(see Supplementary Information for references of included studies). Although
the majority of the studies were journal articles (k = 45, 52 per cent), a number
of dissertations/theses and articles in conference proceedings were included as
well (k = 19, 22 per cent and k = 15, 19 per cent, respectively). A small number
of studies were also found in book chapters (3), conference presentations
(PowerPoints; 2), and unpublished manuscripts (1). Unfortunately, the norm
in L2 meta-analyses has not been to take such an inclusive approach, leaving
synthetic samples much more susceptible to the inflating effects of publication
bias (Plonsky and Brown 2014).
Within these 86 reports, effect sizes were extracted based on 110 withingroup (pre–post) and 60 between-group (control–experimental) samples. This
sample is substantially larger than any of the meta-analyses of L2 instruction
mentioned above (Figure 1). The total N for all studies was 2,782, consisting of
777 control participants (median = 12) and 2005 experimental participants
(median = 14).
Although the studies in this sample date back over 32 years, most are fairly
recent. The sample includes three studies from the years 1982 to 1989, seven
from 1990 to 1997, sixteen from 1998 to 2005, and fifty-nine from 2006 to
2013 (including one paper that is ‘in press’). (The date of one additional study
could not be identified.) Interest in the effectiveness of PI is clearly strong and
increasing, despite frequent claims to the contrary (e.g. Derwing and Munro
2005).
352
THE EFFECTIVENESS OF SECOND LANGUAGE PRONUNCIATION INSTRUCTION
Coding
Analysis
Both research questions involved calculating descriptive statistics based on
effect sizes. Multiple effects or outcomes derived from a single sample were
combined (averaged). Each sample’s effects were kept separate to preserve
differences between multiple treatment groups. Two outliers based on
within-groups contrasts from a single report were identified (d = 8.80, 8.55)
and excluded from further analysis. In order to inspect the data for additional
irregularities and/or evidence of publication bias, a funnel plot (i.e. a
scatterplot of effect sizes on the x-axis and sample sizes on the y-axis) was
created and examined.
Research Question 1 (overall effectiveness) was then addressed by
calculating sample-size-weighted descriptive statistics for the d values from
the entire sample. Effects resulting from within-groups designs often appear
larger than those for between-groups because participants in the former serve
as their own control, thus reducing error variance and inflating the observed
effect (Plonsky and Oswald 2014). This difference can be corrected based on
pre–post correlations, but such data were not reported in any of the studies in
the sample. Therefore, effects from pre–post contrasts and between-groups
contrasts were analyzed separately.
Research Question 2 addressed variability in observed effects as a function of variables suggested to moderate the effectiveness of PI. Study features (i.e. different contexts, treatments, and outcomes) were therefore
treated as independent variables and used to group and calculate summary
effects which were, again, weighted by sample size. Although betweengroup contrasts arguably provide a more theoretically and statistically
accurate depiction, the available sample of pre–post effects was much
larger and therefore more reliable/robust: K = 110 vs. 60. From a practical
standpoint, pre–post effects also provide insight regarding what might be
expected for PI as implemented in real classrooms. This phase of the
analysis is therefore based on both within- and between-group contrasts,
with an emphasis on the former.
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
Each study was coded for substantive and methodological features as well as
effect sizes (Cohen’s d) in order to answer our two research questions. In
particular, our coding scheme was designed to extract data related to study
contexts, treatment types, and outcome variables (see Supplementary Table 1).
In order to ensure interrater reliability, the second and third authors both
coded the entire sample, as recommended by Plonsky and Oswald (2012).
Disagreements were discussed and resolved, and operational definitions were
adjusted when necessary.
J. LEE, J. JANG, AND L. PLONSKY
60
353
d = 0.83
50
40
20
10
0
-0.5
0.5
1.5
2.5
3.5
4.5
Figure 2: Funnel plot of effect sizes (d; x-axis) and sample sizes (N; y-axis) for
within-group contrasts
RESULTS
As described in the Methods, we created funnel plots to inspect the data for the
presence of publication bias or other irregularities. We see in Figures 2 and 3 ,
first of all, substantial variability in effect sizes, with greater spread at the
bottom of the figure where samples are smaller and sampling error is higher.
We also see that these effects are not spread equally on both sides of the mean
effect, particularly in the case of pre–post effects (Figure 2; unweighted
d = 0.83). Rather, larger effects (to the right of the mean) display much
wider variability than smaller ones (left of the mean). This difference in
spread is indicative of a bias toward statistically significant results. In the absence of bias (i.e. when observing a sample representative of the population of
effects), we would expect to see relative symmetry on both sides of the mean.
The results for Research Question 1, which addressed the overall effects of
PI, are found in Table 1. Summary results from both within- and betweengroup designs (weighted by sample size) show that PI is indeed effective.
Observed effects on average and across the 95 per cent confidence intervals
demonstrate a medium-to-large and statistically significant effect (Plonsky and
Oswald 2014).
Whereas Research Question 1 was concerned with the overall/summary
effects of PI, the focus of Research Question 2 was on potential moderators
of those effects. In other words, this phase of the analysis sought to examine
variability across the sample as a function of different (i) contexts, (ii) treatments (including targeted linguistic features), and (iii) outcome types found in
studies of PI.
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
30
354
THE EFFECTIVENESS OF SECOND LANGUAGE PRONUNCIATION INSTRUCTION
80
d = 0.69
70
60
50
30
20
10
0
-1
-0.5
0
0.5
1
.5
2
2.5
3
3.5
Figure 3: Funnel plot of effect sizes (d; x-axis) and sample sizes (N; y-axis) for
between-group contrasts
Table 1: Overall results for the effectiveness of L2 pronunciation instruction
Contrast/design
Within-group
Between-group
K
110
60
M (d)
0.89
0.80
SE
0.02
0.02
95 per cent CIs
Lower
Upper
0.85
0.77
0.94
0.81
Tables 2 and 3 present the results of moderator (or subgroup) analyses for
contextual variables. Confidence intervals for numerous subgroups here often
do not overlap, indicating that differences between their effects are statistically
significant. Moreover, several patterns among within-group contrasts are
worth noting such as larger effects in second-language settings, high schools,
and in studies with both beginner and advanced learners (as opposed to intermediate ones). In contrast to what we might expect, laboratory-based studies
produced smaller effects than those carried out in classrooms. However, the
opposite pattern was found in between-groups contrasts (d = 0.79 in classrooms
vs. 0.95 in laboratories; see Table 3).
A number of treatment-related variables were also examined for their potential to moderate the effects of PI (see Tables 4 and 5). Longer interventions
were found to produce substantially larger effects than shorter ones in both
within- and between-group contrasts. Treatments that included feedback as
part of the treatment also outperformed those without, particularly in
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
40
J. LEE, J. JANG, AND L. PLONSKY
355
Table 2: Moderator analyses across contexts (within-group contrasts)
Grouping variables and values
M (d)
SE
95 per cent CIs
Lower
Upper
39
71
1.01
0.83
0.04
0.03
0.94
0.78
1.08
0.89
10
86
10
1.42
0.83
0.66
0.07
0.02
0.05
1.30
0.78
0.56
1.56
0.87
0.76
71
32
0.95
0.84
0.03
0.03
0.89
0.78
1.01
0.89
24
42
16
1.27
0.55
1.19
0.06
0.02
0.06
1.15
0.51
1.07
1.40
0.58
1.32
a
Number of samples in subgroups. In a small number of cases (e.g. elementary schools), subgroup results in this and the following tables were excluded due to very small cell sizes.
between-group designs. The effects of computer-provided treatments and
those otherwise involving technology such as spectrograms, however, both
yielded small effects compared with those provided by a teacher or a
teacher–researcher and without the use of technology, respectively, in both
within- and between-group designs. Effects across targeted linguistic features
appear relatively homogenous. In within- and between-group contrasts, word
stress, sentence stress, rhythm, and PI on both segmentals and suprasegmentals (compared with segmentals or suprasegmentals on their own) present
exceptions wherein somewhat larger effects were produced.
The third and final set of variables we examined as potential moderators of
the effects of PI were different types of outcome measures (Tables 6 and 7). We
first compared effects resulting from outcomes that involved free production
vs. controlled, the latter of which yielded much larger effects in both withinand between-group contrasts. Controlled production was also by far the preferred outcome type for both within- and between-groups designs (k = 75 of
110 samples and 44 of 60). The patterns for rater effects and different item
lengths in outcome measures are also strong. Posttreatment production assessed by native speakers of the target language was approximately twice as
large as when rated by nonnative speakers in within-group contrasts. With
respect to different item types, effects for within- and between-group contrasts
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
Setting
Second language
Foreign language
Institution
High school
University
Language institute
Context
Classroom
Laboratory
Proficiency
Beginner
Intermediate
Advanced
ka
356
THE EFFECTIVENESS OF SECOND LANGUAGE PRONUNCIATION INSTRUCTION
Table 3: Moderator analyses across contexts (between-group contrasts)
Grouping variables and values
M (d)
SE
95 per cent CIs
Lower
Upper
16
44
0.35
0.98
0.02
0.02
0.31
0.94
0.38
1.03
7
46
5
1.19
0.77
1.09
0.09
0.02
0.03
1.01
0.73
1.03
1.37
0.81
1.15
38
17
0.79
0.95
0.02
0.03
0.75
0.89
0.84
1.00
18
17
7
0.97
0.80
0.01
0.04
0.02
0.02
0.90
0.76
0.06
1.04
0.85
0.03
were quite different. Whereas the former found larger effects increasing along
with longer item lengths, the opposite pattern was observed in the latter (i.e.
words > sentences > discourse).
DISCUSSION
This study examined the overall effects of PI and potential moderators of those
effects. The discussion that follows summarizes and interprets the results, contextualizing the findings with respect to the domain in question as well as
more generally within instructed SLA. We also take advantage of the metaanalytic data set to critique and suggest methodological improvements in PI
research.
In terms of overall effects of PI, the (weighted) within-group results showed
that the learners who received instructional treatments improved by 0.89
standard deviation units in comparison with their pretreatment performance;
the between-group analyses demonstrated that learners in experimental
groups outperformed those in control groups by 0.80 standard deviation
units. Because the confidence intervals of the effect sizes do not cross zero,
both can be said to be statistically reliable. Nevertheless, individual effects
across the sample vary (from 0.36 to 3.98 and 0.66 to 3.12 for withinand between-group contrasts, respectively). It is also interesting to note that
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
Setting
Second language
Foreign language
Institution
High school
University
Language institute
Context
Classroom
Laboratory
Proficiency
Beginner
Intermediate
Advanced
ka
J. LEE, J. JANG, AND L. PLONSKY
357
Table 4: Moderator analyses across treatment types (within-group contrasts)
Grouping variables and values
M (d)
SE
95 per cent CIs
Lower
Upper
44
40
0.62
1.32
0.02
0.05
0.57
1.23
0.67
1.41
53
8
20
29
0.85
0.43
1.35
0.75
0.03
0.05
0.07
0.03
0.79
0.33
1.21
0.70
0.91
0.53
1.50
0.81
53
62
22
19
34
78
43
22
13
0.91
1.04
1.09
0.96
0.86
0.89
1.03
1.00
0.98
0.03
0.03
0.05
0.05
0.03
0.03
0.04
0.06
0.06
0.82
0.98
0.99
0.87
0.80
0.84
0.95
0.88
0.86
0.91
1.11
1.18
1.05
0.91
0.95
1.11
1.11
1.10
69
38
0.96
0.76
0.03
0.03
0.90
0.71
1.02
0.81
26
80
0.89
0.92
0.06
0.02
0.78
0.87
1.01
0.96
the range in observed effects has increased over time. The range of effects from
1982 to 1989, 1990 to 1997, 1998 to 2005, 2006 to 2013 was 0.93, 1.83, 1.90,
and 4.34 for within-group contrasts and 0.66, 0.77, 1.67, and 3.78 for between-group contrasts. Greater variability in results may be indicative of interest and empirical efforts addressing an increasingly larger variety of
pronunciation features and instructional approaches.
According to Plonsky and Oswald’s (2014) scale for interpreting d
values in L2 research, the overall findings of this study represent medium to
large effects. Compared with meta-analytic findings in other areas of instructed
SLA, these results show that instruction on pronunciation can be just as (or
more) effective as vocabulary, grammar, and pragmatics (see Figure 1).
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
Length
Short (4.25 h)
Long (>4.25 h)
Treatment provider
Teacher
Researcher
Teacher–researcher
Computer
Target features
Vowels
Consonants
Stress (word)
Stress (sentence)
Intonation
Segmentals
Suprasegmentals
Segmentals + suprasegmentals
Rhythm
Use of technology
No
Yes
Feedback
No
Yes
k
358
THE EFFECTIVENESS OF SECOND LANGUAGE PRONUNCIATION INSTRUCTION
Table 5: Moderator analyses across treatment types (between-group contrasts)
Grouping variables and values
M (d)
SE
95 per cent CIs
Lower
Upper
25
22
0.73
0.95
0.02
0.03
0.68
0.89
0.77
1.01
35
5
8
12
0.89
0.86
0.94
0.24
0.02
0.02
0.06
0.02
0.85
0.83
0.81
0.19
0.94
0.90
1.06
0.28
27
37
13
10
15
46
24
13
8
0.99
0.79
1.01
1.39
0.38
0.87
1.05
1.28
1.65
0.02
0.02
0.05
0.07
0.05
0.02
0.03
0.04
0.06
0.95
0.75
0.91
1.26
0.73
0.84
0.99
1.20
1.53
1.04
0.84
1.12
1.53
0.94
0.93
1.11
1.36
1.77
44
16
0.87
0.53
0.02
0.04
0.83
0.46
0.91
0.60
13
44
0.62
0.91
0.02
0.02
0.57
0.86
0.66
0.96
However, in light of evidence in this study in favor of a bias toward statistically
significant results observed in the funnel plot, we might consider the overall
findings to overestimate true population effects.
Drawing on theoretical and practical concerns, we also examined variability in the effects of PI as a function of three categories of potential moderating variables: contexts, treatments, and outcomes. Among other results across
contexts of PI, learner age (level of education) was found to be related to
treatment effects. At first glance, these findings might appear to support
what we would expect for the role of age and L2 pronunciation development. A more nuanced interpretation of this result, however, would also
consider the fact that age effects are generally much stronger in second vs.
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
Length
Short (4.25 h)
Long (>4.25 h)
Treatment provider
Teacher
Researcher
Teacher-researcher
Computer
Target features
Vowels
Consonants
Stress (word)
Stress (sentence)
Intonation
Segmentals
Suprasegmentals
Segmentals + suprasegmentals
Rhythm
Use of technology
No
Yes
Feedback
No
Yes
k
J. LEE, J. JANG, AND L. PLONSKY
359
Table 6: Moderator analyses across outcome types (within-group contrasts)
Grouping variables and values
M (d)
SE
95 per cent CIs
Lower
Upper
75
18
16
0.96
0.65
0.86
0.03
0.03
0.05
0.90
0.59
0.77
1.02
0.71
0.95
10
73
0.44
0.93
0.03
0.03
0.38
0.87
0.49
0.98
30
22
27
27
0.62
0.92
1.23
0.79
0.04
0.04
0.05
0.03
0.55
0.84
1.13
0.68
0.70
1.01
1.34
0.78
foreign language contexts, where exposure is limited almost exclusively to
classroom instruction (Trofimovich et al. 2009; Muñoz 2011). Furthermore,
because of the lack of available evidence in primary studies, these results are
not based on studies with learners within what is typically considered to be
the critical period (0 to around 12). Our findings with respect to age and the
effects of PI should therefore not be considered conclusive.
Unlike the findings for age, no clear pattern was found for the effects of PI
across proficiency levels. Practically speaking, these findings suggest that learners at different proficiencies can all benefit from PI. The lack of a clear relationship between proficiency and the effects of PI might also be attributed, at
least in part, to the challenges inherent in reliably and validly identifying
proficiency at the primary and meta-analytic levels. As in previous meta-analyses, we were limited in our ability to determine and code for proficiency by
what primary authors reported.
Replicating the findings from several previous meta-analysis of instructed
SLA (e.g. Li 2010; Plonsky 2011), our results show that laboratory-based PI
may produce stronger effects than when carried out in intact classes. Both,
however, are effective. Interestingly, the choice of setting for PI research appears to have changed in this domain. Studies of PI have migrated over time
from laboratories to classrooms, a shift often seen in other social sciences
where experimental effects are explored in low-stakes environments before
testing them in applied contexts such as classrooms (Oswald and Plonsky
2010).
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
Outcome type
Controlled
Free
Both
Rater
Nonnative speaker(s)
Native speaker(s)
Outcome item length
Words
Sentences
Discourse
Multiple
k
360
THE EFFECTIVENESS OF SECOND LANGUAGE PRONUNCIATION INSTRUCTION
Table 7: Moderator analyses across outcome types (between-group contrasts)
Grouping variables and values
M (d)
SE
95 per cent CIs
Lower
Upper
44
6
10
0.96
0.37
0.61
0.03
0.04
0.02
0.89
0.30
0.57
1.00
0.44
0.66
9
39
0.86
0.70
0.06
0.02
0.74
0.66
0.97
0.74
14
18
10
17
1.16
0.87
0.23
0.68
0.04
0.04
0.03
0.02
1.08
0.80
0.18
0.65
1.25
0.95
0.29
0.71
Our results also reveal several trends across different types of PI treatments.
First, as we might expect, longer treatments (i.e. longer than the median intervention of 4.25 h) generally produced larger effects. This finding confirms
Saito’s (2012) synthesis which found that one of only two studies not showing
an effect of PI included a treatment of only 15–30 min (Macdonald et al. 1994;
Note: This study was not included in our analysis owing to missing/unreported
data). It also replicates results of previous meta-analyses of instructed SLA
examining the effects of treatment length (e.g. Jeon and Kaya 2006).
Because of the potential for this body of research to inform L2 pedagogy, the
practical significance of this finding, along with many others, should be given
critical consideration. (See Plonsky and Oswald 2014, for a discussion on
weighing practical significance against resources such as time and experimental manipulation required to induce effects.)
Another treatment feature associated with larger effects is the provision of
feedback. Hundreds of primary studies and 18 meta-analyses of feedback research (see Plonsky and Brown 2014) have shown positive effects for
feedback. This massive body of research, however, has almost exclusively considered feedback on lexical and morphosyntactic errors. As a point of theoretical interest, the results of this study suggest that previous findings with respect
to feedback are robust to the domain of L2 pronunciation as well. From a
practical standpoint, these results also show that including feedback in a program of PI can improve its effectiveness (Saito and Lyster 2012). Given the
robustness of feedback effects found for other target domains (i.e. grammar
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
Outcome type
Controlled
Free
Multiple
Rater
Nonnative speaker(s)
Native speaker(s)
Outcome item length
Words
Sentences
Discourse
Multiple
k
J. LEE, J. JANG, AND L. PLONSKY
361
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
and vocabulary), this finding is perhaps not surprising. It is, however, worth
noting that prior to this study aggregate findings for this effect were lacking in
the synthetic literature.
The opposite was found for the use of technology and computer-delivered
PI. Studies that provided PI using technology, whether entirely or in part,
produced smaller effects than those that relied exclusively on human-delivered
instruction. The lack of adaptability and perceptual accuracy in computers
compared to human teachers, and perhaps consequently their ability to provide appropriate feedback as well, may partly explain this finding. Although
the accessibility of computer-delivered PI has great potential, there is clearly a
need for research seeking to improve technology-enhanced instructional
materials.
Sorely missing from our results is a more fine-grained analysis of the effects
of different types of pedagogical practices. The norm in this body of research
was to simply refer to a general approach such as Celce–Murcia et al.’s (1996)
five stages for PI. While reading the Methods sections, we were often left
wondering about the details of instructional materials and activities. For example, did they consist of decontextualized drills? Was PI embedded in meaning-oriented tasks? And to what extent did the pedagogy match researchers’
efforts to assess learner development? Future studies would do well to include
greater procedural detail in written reports.
This study also examined the relative effects of PI across a range of targeted
linguistic features. Much like Saito (2012), our study found relatively homogenous effects of PI on different features. Our results also add precision to
Saito’s. That is, the vote-count procedure he employed indicated only a consistently positive direction of effects for instruction on segmentals and suprasegmentals. The present study goes further by providing both the direction
(positive) and magnitude of such effects, which are relatively strong and
stable. We view these findings positively, implying that PI can be effective
for a wide variety of features. Furthermore, in light of the larger effects
observed when PI targeted both segmental and suprasegmental features (as
opposed to either one independently), we echo other scholars’ recommendation that L2 practitioners consider including a variety of features in their
curricula (c.f. Kang et al. 2010). Rather than focus exclusively on the oftendebated segmental/suprasegmental distinction, the results of our study support
an approach that treats sets of features that align with learners’ needs, backgrounds, and first languages (Saito 2012).
In addition to different contexts and treatments in PI research, we also
examined the moderating effects of different outcome types. Again, as in previous meta-analyses of instructed SLA research (e.g. Norris and Ortega 2000;
Spada and Tomita 2010), our findings show that the choice of outcome measure can affect study results. Specifically, studies employing more controlled
outcome measures/items produced larger effects than more open-ended ones.
Although the latter are likely more representative of learners’ true ability, the
former are perhaps more similar to practice activities carried out during
362
THE EFFECTIVENESS OF SECOND LANGUAGE PRONUNCIATION INSTRUCTION
Critiques and suggestions for future research
A meta-analytic data set not only enables the researcher to critique the domain
in question; it is his/her duty to do so. We therefore conclude our discussion
with a list of limitations observed in our sample. In order to move the domain
toward areas that merit attention as well as improved methodological practice,
each critique is accompanied by suggestions for future research.
Critique 1: The validity of PI research both in individual studies and in
the aggregate is threatened by the use of very small samples and correspondingly low statistical power. A post hoc power analysis based on the
results of this study shows observed power within and between groups to
be just 0.66 and 0.55, respectively. Underpowered studies not only limit our
ability to detect true effects, they also lead to an uneven depiction of population effects (via publication bias) when summarized at the secondary level.
Fortunately, there is a relatively (if also deceptively) simple two-part solution to this problem. First, PI research, like much of SLA, needs larger samples.
In some cases, owing to practical constraints of recruiting participants, this will
require less subgroup comparisons. However, more reliable results are certainly preferable to a greater number of less reliable ones. And secondly, PI
researchers ought to move away from the dichotomous thinking embodied by
null hypothesis significance testing, focusing instead on point estimates and
their practical significance as expressed by effect sizes (Norris and Ortega 2006;
Plonsky 2013). Considering only 17 studies (20 per cent of the sample) reported effect sizes, and almost none provided a useful interpretation of those
effects, change in this direction may be slow.
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
experimental treatments. Further complicating this matter is the fact that most
studies relied on outcome measures of a very controlled nature (e.g. reading
lists of individual words or sentences). In order to get a more fulsome understanding of treatment effects, future studies of PI should follow the recent
trend in L2 vocabulary and grammar research by including different types of
outcome measures (see Mackey and Goo 2007; Spada and Tomita 2010).
Results from less controlled instruments may be more difficult to analyze,
but practical challenges are a small price to pay for authenticity and ecological
validity.
One final and critical consideration relevant to this discussion is instrument
reliability in PI research. The main issue here is not—as far as we can tell—low
reliability. Rather, it is the lack of availability of reliability in study reports
(only 47 per cent of the sample), which limits our ability to accurately interpret
study results. Although there is clearly room for improvement here, the presence of reliability estimates in PI research is greater than in several other previously meta-analyzed domains of SLA, where they have been found
anywhere from 6 per cent (L2 practice; Nekrasova and Becker 2009) to 64
per cent (L2 interaction; Plonsky and Gass 2011).
J. LEE, J. JANG, AND L. PLONSKY
363
SUPPLEMENTARY DATA
Supplementary material is available at Applied Linguistics online.
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
Critique 2: Sampling in PI research is not only underpowered, it lacks diversity, particularly in terms of different ages, first languages, and target
languages. Whereas the previous critique poses a threat to internal validity,
this problem puts into question the external validity or generalizability of PI
research. Only 4 of 86 primary reports in our sample involved participants less
than 13 years old. Considering many pronunciation errors are L1 and L2 specific (e.g. Derwing and Munro 2013), it is perhaps even more concerning that
English was either the participants’ first language or the target language in 83
of the 86 studies.
The way forward here for PI researchers is to consider recruiting younger
participants as well as learners other than those whose L1 or L2 is English.
Doing so may require reaching out to and making connections with teachers
and researchers outside of our home institutions.
Critique 3: Three features of PI designs are in need of improvement. First,
only 14 per cent of the sample examined the longevity of effects by means of a
delayed posttest. Interestingly, studies in this sample that included delayed
posttests in their design produced larger effects, a pattern also observed in
Plonsky (2011, 2013) and Plonsky and Gass (2011). In order to determine
the practical significance of PI in ‘real-world’ settings, where learning matters
beyond a brief intervention, delayed posttests must be incorporated into future
studies. This practice would also contribute to theoretical discussions related to
the durability of instructional treatments. Secondly, pre–post designs far outnumber controlled experiments. This trend is likely due in part to the use of
intact samples where it may be inappropriate or even unethical to withhold
treatment for the sake of experimental control. Although pre–post designs can
help guide L2 practitioners’ expectations, absolute effects can only be measured through more rigorously controlled designs. Third, PI research relies too
heavily—primarily even—on controlled outcome measures, thus again limiting the external validity of study results. There is clearly a need for greater
variety of outcome measures in PI research.
Critique 4: Two substantive issues observed in this body of research are also
worth noting. First is a lack of attention to a number of phonetic and phonological features such as articulation, elision, linking, and stress. And secondly,
the interactions between different treatments and learner backgrounds (i.e.
aptitude-treatment interaction research, or ATI) present a potential source of
findings in PI with relevance for L2 theory and practice. To date, however, a
very small number of studies have examined such interactions (e.g. Elliot
1997) unlike the rapidly growing body of ATI research in the realm of grammar instruction (e.g. Li 2013).
364
THE EFFECTIVENESS OF SECOND LANGUAGE PRONUNCIATION INSTRUCTION
FUNDING
This work was supported in part by the Hankuk University of Foreign Studies
Research Fund given to Junkyu Lee
Conflict of interest statement. None declared.
1 Vote-counting’ is a type of synthesis
wherein the researcher, as in meta-analysis, systematically collects and codes
primary studies in a given domain.
Unlike meta-analysis, which produces a
quantitative indication of the magnitude
of the relationship in question, however, the results of vote-counting speak
only to the direction of effects.
2 Our sincere thanks to the authors of the
five reports who provided us with the
data we needed to include their studies
in our analysis: Walcir Cardoso;
Manuela
Gonzalez–Bueno
and
Marcela Quintana Lara; Rebecca
Hincks; Gillian Lord; and Eleni
Tsiartsioni.
REFERENCES
Bertram, S. 2008. A Case Study of the NoticingReformulation Technique. Unpublished MA
thesis. Hamline University.
Celce–Murcia,
M.,
D.
Brinton,
and
J. Goodwin. 1996. Teaching Pronunciation: A
Reference for Teachers of English to Speakers of
Other Languages. Cambridge University Press.
Chiu, Y. -H. 2013. ‘Computer-assisted second language vocabulary instruction: A meta-analysis,’
British Journal of Educational Technology 44: E52–6.
Darcy, I., D. Ewert, and R. Lidster. 2012.
‘Bringing pronunciation instruction back
into the classroom. An ESL teachers’ pronunciation ‘toolbox’ in J. Levis and K. Lavelle
(eds): Proceedings of the 3rd Pronunciation in
Second Language Learning and Teaching
Conference. Iowa State University, pp. 93–108.
Derwing, T. M. 2003. ‘What do ESL students
say about their accents?,’ The Canadian
Modern Language Review 59: 547–66
Derwing, T. M. and M. J. Munro. 2005.
‘Second language accent and pronunciation
teaching: A research-based approach,’ TESOL
Quarterly 39: 379–97.
Derwing, T. M. and M. J. Munro. 2013. ‘The
development of L2 oral language skills in two
L1 groups: A 7-year study,’ Language Learning
63: 163–85.
Derwing, T. M., M. Munro, and G. Wiebe.
1998. ‘Evidence in favor of a broad framework
for pronunciation instruction,’ Language
Learning 48: 393–410.
Derwing, T. M., R. I. Thompson, and M.
J. Munro. 2006. ‘English pronunciation and
fluency development in Mandarin and Slavic
speakers,’ System 34: 183–93.
Elliot, A. S. 1997. ‘On the teaching and acquisition of pronunciation within a communicative
approach,’ Hispania 80: 95–109.
Flege, J. E., G. H. Yeni-Komshian, and S. Liu.
1999. ‘Age constraints on second-language acquisition,’ Journal of Memory and Language 41:
78–104.
Goo, J., G., Granena, Y. Yilmaz, and
M. Novella (in press). Implicit and explicit
instruction in L2 learning: Norris & Ortega
(2000) revisited and updated. In P. Rebuschat
(Ed.), Implicit and Explicit Learning of Languages.
John Benjamins.
Gordon, J. and I. Darcy. 2012. ‘The development of comprehensible speech in L2 learners:
Effects of explicit pronunciation instruction on
segmentals and suprasegmentals,’ Paper presented at AAAL. Boston, MA.
Hahn, L. D. 2004. ‘Primary stress and intelligibility: Research to motivate the teaching
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
NOTES
J. LEE, J. JANG, AND L. PLONSKY
Li, S. 2013. ‘The interactions between the effects
of implicit and explicit feedback and individual
differences in language analytic ability and
working memory,’ Modern Language Journal
97: 634–54.
Lord, G. 2008. ‘Podcasting communities and
second language pronunciation,’ Foreign
Language Annals 41: 374–89.
MacDonald, D., G. Yule, and M. Powers.
1994. ‘Attempts to improve English L2 pronunciation: The variable effects of different
types of instruction,’ Language Learning 44:
75–100.
Mackey, A., and J. Goo (2007). Interaction
research in SLA: A meta-analysis and research
synthesis. In A. Mackey (Ed.), Conversational
Interaction in Second Language Acquisition: A
Collection
of
Empirical
Studies.
Oxford
University Press, pp. 407–51.
Miller, J. S. 2012. ‘Teaching French pronunciation with phonetics in college-level beginner
French course,’ The NECTFL Review 69: 47–68.
Miller, J. S. 2013. ‘Improving oral proficiency by
raising metacognitive awareness with recordings’ in J. Levis and K. LeVelle (eds):
Proceedings of the 4th Pronunciation in Second
Language Learning and Teaching Conference.
Iowa State University, pp. 101–11.
Muñoz, C. 2011. ‘Input and long-term effects of
starting age in foreign language learning,’
International Review of Applied Linguistics in
Language Teaching 71: 197–220.
Nekrasova, T. and T. Becker. 2009.
‘Effectiveness of practice: A research synthesis
and quantitative meta-analysis,’ Unpublished
manuscript.
Norris, J. M. and L. Ortega. 2000. ‘Effectiveness
of L2 instruction: A research synthesis and
quantitative meta-analysis,’ Language Learning
50: 417–528.
Norris, J. M. and L. Ortega. 2006. ‘The value
and practice of research synthesis for language
learning and teaching’ in J.M. Norris and
L. Ortega (eds): Synthesizing Research on
Language Learning and Teaching. John
Benjamins, pp. 3–50.
Oswald, F. L. and L. Plonsky. 2010. ‘Meta-analysis in second language research: Choices and
challenges,’ Annual Review of Applied Linguistics
30: 85–110.
Plonsky, L. 2011. ‘The effectiveness of second
language strategy instruction: A meta-analysis,’
Language Learning 61: 993–1038.
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
of suprasegmentals,’ TESOL Quarterly 38:
201–23.
Hardison, D. M. 2005. ‘Contextualized computer-based L2 prosody training: Evaluating the
effects of discourse context and video input,’
CALICO Journal 22: 175–90.
Ingels, S. 2010. ‘The effects of self-monitoring
strategy use on the pronunciation of learners
of English’ in J. Levis and K. LeVelle (eds):
Proceedings of the 1st Pronunciation in Second
Language Learning and Teaching Conference.
Iowa State University, pp. 67–89.
Isaacs, T. and R. I. Thomson. 2013. ‘Rater experience, rating scale length, and judgments of
L2 pronunciation: Revisiting research conventions,’ Language Assessment Quarterly 10: 135–59.
Isaacs, T. and P. Trofimovich. 2012.
‘Deconstructing comprehensibility: Identifying
the Linguistic Influences on Listeners’ L2
Comprehensibility Ratings,’ Studies in Second
Language Acquisition 34: 475–505.
Jeon, E. H. and T. Kaya. 2006. ‘Effects of L2
instruction on interlanguage pragmatic development: A meta-analysis’ in J. M. Norris and
L. Ortega (eds): Synthesizing Research on
Language Learning and Teaching. John
Benjamins, pp. 165–211.
Kang, O. 2010. ‘Relative salience of suprasegmental features on judgments of L2 comprehensibility and accentedness,’ System 38:
301–15.
Kang, O. 2012. ‘Impact of rater characteristics on
ratings of international teaching assistants’ oral
performance,’ Language Assessment Quarterly 9:
249–69.
Kang, O., D. Rubin, and L. Pickering. 2010.
‘Suprasegmental measures of accentedness
and judgments of language learner proficiency
in oral English,’ Modern Language Journal 94:
554–66.
Krashen, D. 1982. Principles and Practices in
Second Language Acquisition. Pergamon.
Lee, S. T. 2008. ‘Teaching pronunciation using
computer-assisted learning software: An action
research studies in an institute of technology
in Taiwan,’ EdD dissertation, Australian
Catholic University.
Levis, J. 2005. ‘Changing contexts and shifting
paradigms in pronunciation teaching,’ TESOL
Quarterly 39: 369–77.
Li, S. 2010. ‘The effectiveness of corrective feedback in SLA: A meta-analysis,’ Language
Learning 60: 309–65.
365
366
THE EFFECTIVENESS OF SECOND LANGUAGE PRONUNCIATION INSTRUCTION
Shintani, N. (2015). ‘The effectiveness of processing instruction on L2 grammar acquisition: A
meta-analysis,’ Applied Linguistics 36/3: 306–25.
Shintani, N., S. Li, and R. Ellis. 2013.
‘Comprehension-based versus production-based
grammar instruction: A meta-analysis of comparative,’ Language Learning 63/2: 296–329.
Spada, N. and Y. Tomita. 2010. ‘Interactions
between type of instruction and type of language feature: A meta-analysis,’ Language
Learning 60: 263–308.
Tokumoto, M. and M. Shibata. 2011. ‘Asian
varieties of English: Attitudes towards pronunciation,’ World Englishes 30: 392–408.
Trofimovich, P., P. M. Lightbown, and R.
H. Halter. 2009. ‘Comprehension-based practice: The development of L2 pronunciation in a
listening and reading program,’ Studies in
Second Language Acquisition 31: 609–39.
Tsiartsioni, E. 2010. ‘The effectiveness of
pronunciation teaching to Greek state school
students’
in
A.
Psaltou-Joycey
and
M. Mattheoudaki (eds): Selected Papers from
the Proceedings of the 14th International
Conference of the Greek Applied Linguistics
Association. GALA, pp. 429–46.
VanPatten, B. 2002. ‘Processing instruction: An
update,’ Language Learning 52: 755–803.
Wa-Mbaleka, S. 2006. A Meta-analysis Investigating
the Effects of Reading on Second Language
Vocabulary Learning. Unpublished doctoral dissertation, Northern Arizona University.
Won, M. 2008. The Effects of Vocabulary Instruction
on English Language Learners: A Meta-analysis.
Unpublished doctoral dissertation, Texas Tech
University.
Yates, K. 2003. ’Teaching linguistic mimicry to
improve second language pronunciation,’
Masters thesis, University of North Texas.
Downloaded from https://academic.oup.com/applij/article/36/3/345/2422438 by guest on 10 December 2020
Plonsky, L. 2013. ‘Study quality in SLA: An assessment of designs, analyses, and reporting
practices in quantitative L2 research,’ Studies
in Second Language Acquisition 35: 655–87.
Plonsky, L. and D. Brown. 2015. ‘Domain definition and search techniques in meta-analyses
of L2 research (Or why 18 meta-analyses of
feedback have different results),’ Second
Language Research 31: 267–78.
Plonsky, L. and S. Gass. 2011. ‘Quantitative research methods, study quality, and outcomes:
The case of interaction research,’ Language
Learning 61: 325–66.
Plonsky, L. and F. L. Oswald. 2012. ‘How to do
a meta-analysis’ in A. Mackey and S. M. Gass
(eds): Research Methods in Second Language
Acquisition: A Practical Guide. Wiley Blackwell,
pp. 275–95.
Plonsky, L. and F. L. Oswald. 2014. ‘How big is
‘big’? Interpreting effects sizes in L2 research,’
Language Learning 64: 878–91.
Plonsky, L., J. Egbert, and G. T. LaFlair. in press.
‘Bootstrapping in applied linguistics: Assessing its
potential using shared data,’ Applied Linguistics,
doi:101093/applin/amu001.
Saito, K. 2012. ‘Effects of instruction on L2 pronunciation development: A synthesis of 15
quasi-experimental
intervention
studies,’
TESOL Quarterly 46: 842–54.
Saito, K. 2014. ‘Experienced teachers’ perspectives on priorities for improved intelligible pronunciation: The case of Japanese learners of
English,’ International Journal of Applied
Linguistics 24: 250–27.
Saito, K. and R. Lyster. 2012. ‘Effects of formfocused instruction and corrective feedback on
L2
pronunciation
development
of/r/by
Japanese learners of English,’ Language
Learning 62: 595–633.
Session 14_ Research Synthesis (Meta-analysis)
Research Literacy (A&HL 5575)
Introduction
! Meta-analysis is a formalized statistical method for averaging effects
found across a set of studies or scientific observations.
! A meta-analysis calculates the mean and variance of a set of numbers such
as study correlations (r) or standardized mean differences (d).
! In the field of SLA, meta-analysis was not formally introduced until much
more recently (e.g., Ross, 1998, and then Norris & Ortega, 2000).
Vafaee, A&HL5575 (Class 14)
2
Why Meta-Analysis
! Narrative reviews often do not account for sampling error variance.
! Narrative reviews is their general over-reliance on the ritual of null
hypothesis significance testing (NHST).
! Although experts have a vast storehouse of discipline-specific knowledge,
as humans, they are fallible.
! Meta-analysis can answer focused substantive research questions such as
“What is the overall effect of a particular treatment or intervention (e.g.,
reading strategy instruction on second language (L2) reading ability;
Taylor, Stevens, & Asher, 2006)?” and “How strong is the relationship
between two or more constructs (e.g., motivation and L2 achievement;
Masgoret & Gardner, 2003)?”
Vafaee, A&HL5575 (Class 14)
3
How to Do a Meta-Analysis
1) Defining the Research Domain
” Consider several corrective feedback meta-analyses:
! Li’s (2010): The effects of oral or computer-mediated feedback on any type of
L2 feature
! Russell and Spada (2006): restricted their meta-analysis to grammatical forms
! Lyster & Saito (2010): classroom contexts
! Truscott (2007) and Poltavtchenko & Johnson (2009): L2 writing
! Norris & Ortega (2000) and Mackey & Goo (2007): meta-analyzed the effects
of different types of error correction as subsets of larger syntheses on L2
instruction and L2 interaction.
” The scope of research covered is sometimes guided by the statistical
question regarding the minimum number of primary studies required
for an appropriate meta-analysis.
Vafaee, A&HL5575 (Class 14)
4
How to Do a Meta-Analysis
2) Conducting the Literature Search
! The most popular databases among L2 meta-analysts:
” Education Resources Information Center (ERIC; http://www.eric.ed.gov),
” Linguistics and Language Behavior Abstracts (LLBA; http:// www.csa.com/
factsheets/llba-set-c.php),
” PsycINFO (http://www.apa.org/pubs/ databases/psycinfo/index.aspx)
” Academic Search Premier and
” ProQuest Dissertations and These
” Web of Sciences
” Google Scholar
” Eligible studies might also be found by “manually” searching book chapters,
journal archives, conference programs, technical reports, websites of government
and non-government agencies (e.g., Center for Applied Linguistics, Title VI
Language Research Centers), as well as more personal and/or interactive venues
Vafaee, A&HL5575 (Class 14)
5
How to Do a Meta-Analysis
3) Filtering the Literature:
! Inclusion and Exclusion Criteria
” Example: Truscott (2007) used 548 words to describe very specific inclusion
criteria for his meta-analysis of corrective feedback on L2 writing (d = −0.16). A
similar meta-analysis by Poltavtchenko and Johnson (2009) used 42 words to
describe their broader inclusion criteria and obtained a result that differed in both
size and direction (d = 0.33).
! Generally, we suggest that it is much better to over-search the literature
than to under-search it.
Vafaee, A&HL5575 (Class 14)
6
How to Do a Meta-Analysis
4) Designing a Coding Sheet
! Lipsey and Wilson (2001) categorize the items in a meta-analysis coding
sheet into two general categories: study descriptors and study outcomes.
! Four types of study descriptors are usually coded: (a) study identifiers, (b)
study sample and context, (c) research design, and (d) measures. Study
quality is a fifth category that can be coded for and used during the
analysis phase to weight studies, so that those of higher quality contribute
more to the meta-analytic average or to assess the relationship between
measured research quality and study.
! Study outcomes are effect sizes (d values and correlations) or the
descriptive statistics that allow for their computation (e.g., group means,
standard deviations, regression weights).
Vafaee, A&HL5575 (Class 14)
7
How to Do a Meta-Analysis
5) The Coding Process
! Information from a variety of formats—graphs, tables, text, and so forth—
is translated into a standardized format on the coding sheet.
! During the coding process, expert knowledge is the key.
! Coding can be complex. Consider how L2 proficiency.
! At least one additional rater should be asked to code. Lipsey (2001)
recommends double-coding at least 20 but ideally 50 or more studies.
! Meta-analysts should make their coding procedure and all coding sheets
directly accessible to their readership.
Vafaee, A&HL5575 (Class 14)
8
How to Do a Meta-Analysis
6) Analysis
! Meta-analysis essentially involves calculating a mean effect size and its
corresponding variance
! There can be some challenges in the aggregation process. A single study,
for example, may report multiple effect sizes on the same relationship,
based on multiple settings, multiple groups, multiple measures, and/or
multiple time points. It may be justifiable merely to average them prior to
the meta-analysis. But often, the data dependencies in studies like these are
more complex.
! Another common issue in the analysis phase is how to deal with missing
data.
Vafaee, A&HL5575 (Class 14)
9
How to Do a Meta-Analysis
7) Weighting Effect Sizes:
! Once all effect sizes have been compiled, calculated, or converted into the
same metric (e.g., correlations or d values), it is time to compute the metaanalytic mean and variance, both of which require weighting the effect
sizes.
! One could merely average the effect sizes, but this would be inappropriate
because some effect sizes are more accurate than others. At the very least,
effect sizes should be weighted based on the study sample size.
! Effect sizes can also be weighted to account for the attenuating effects of
measurement unreliability and range restriction.
Vafaee, A&HL5575 (Class 14)
10
How to Do a Meta-Analysis
8) Choice of Meta-Analysis Model: The choice of meta-analysis model determines the approach to
estimating the meta-analytic mean and variance.
!
A fixed effects (FE) model assumes that there is only one population effect size, and all effect sizes are
sample realizations of that population effect. Therefore, under the FE model, any observed variation in
effects across studies is assumed to be predictable, from either sampling error variance, statistical
artifacts (e.g., differences in measurement reliability), or moderator variables.
!
A random effects (RE) meta-analysis model, on the other hand, directly estimates the meta-analytic
variance in effect sizes rather than assume it is zero.
!
The RE model is preferred over the FE model because it is more flexible. Specifically, if the RE model
arrives at a variance estimate that is zero, the RE model becomes the FE model.
!
A forest plot presents the size of the effect on the x axis with the names of the studies being ordered
(alphabetically or by the magnitude of the effect) on the y axis. The plotted points usually bisect a
symmetric horizontal bar that shows the 95% confidence interval (CI), and in the bottom row is the
meta-analytic mean and its 95% CI.
!
A funnel plot provides similar information to a forest plot: It is a scatterplot of the effect size on the x
axis, with some function of measurement precision associated with the effect on the y axis (e.g., the
sample size, the inverse of the sampling error variance).
Vafaee, A&HL5575 (Class 14)
11
How to Do a Meta-Analysis
9) Interpreting the Results
! Meaningfulness is distinct from the size of the effect
! The d value is the effect size metric used most often in meta-analyses of L2
research
! Cohen’s (1988) benchmarks for standardized mean differences (i.e., .20 for
small, .50 for medium, and .80 for large
! Oswald and Plonsky (2010) summarized the quantitative results of 27 meta-
analyses of L2 research and suggested a preliminary set of benchmarks for
interpreting d values in SLA, with .40 representing a generally small effect, .70
medium, and 1.00 large.
! One final consideration with respect to interpreting meta-analytic effect sizes is
the degree to which independent variables in primary research are manipulated.
Vafaee, A&HL5575 (Class 14)
12
Example 1
Plonsky, L. (2011). The effectiveness of second language strategy instruction:
A meta-analysis. Language learning, 61(4), 993-1038.
Background:
Research on L2 strategy instruction has been extensive, but methods and
results in this area have been inconsistent. The goals of this study were to
summarize current findings and examine
theoretical moderators of the
effects of strategy instruction (SI).
Research questions:
● How effective is L2 strategy instruction?
● How is SI affected by different learning contexts, treatments,
outcome variables, and research methods?
Vafaee, A&HL5575 (Class 14)
13
Example 1
Method:
Conventional database searches, Web of Science, and Google Scholar were
used to locate a total of 95 unique samples from 61 studies (N = 6,791) that
met all the inclusion criteria. Each study was then coded on 37 variables.
Five of fifteen authors who were contacted provided missing data for studies
reporting insufficient information to calculate an effect size.
Statistical tools:
Effect sizes (Cohen’s d) were weighted by sample size and combined to
calculate the meta-analytic average, standard error, and confidence intervals.
Publication bias was examined using a funnel plot. Summary effects were
also calculated for subgroups based on study characteristics.
Vafaee, A&HL5575 (Class 14)
14
Example 1
Results:
The meta-analytic d value for the effects of L2 strategy instruction was .49, smaller
than most effects in the L2 domain but comparable to the results of similar syntheses
in L1 educational contexts. Results indicated clear relationships between the effects
of SI and research contexts, type and number of strategies taught, length of
intervention, skill areas, and several indicators of methodological quality.
Vafaee, A&HL5575 (Class 14)
15
Example 2
Li, S. (2010). The effectiveness of corrective feedback in SLA: A meta-analysis.
Language Learning, 60(2), 309–365.
Background:
The theoretical and practical centrality of corrective feedback has led to extensive
research testing its effects, yet disagreement remains over how empirical findings
can inform SLA theory and practice. It is also unclear how different types of
feedback, learning contexts, and L2 features might relate to its effectiveness.
Research questions:
● What is the overall effect of corrective feedback on L2 learning?
● Do different feedback types impact L2 learning differently?
● Does the effectiveness of corrective feedback persist over time?
● What are the moderator variables for the effectiveness of corrective
feedback?
Vafaee, A&HL5575 (Class 14)
16
Example 2
Method:
Primarily, Li searched two academic databases, manually searched the archives of
over a dozen journals of L2 research and scanned the references of review articles.
This study also included 11 dissertations for a total of 33 unique study reports.
Statistical tools:
The Comprehensive Meta-Analysis software program enabled a relatively
sophisticated meta-analysis, statistically speaking. All results were calculated and
presented using both RE and FE models, and availability and publication bias were
addressed using a funnel plot and a trim-and-fill analysis. (Trim-and fill is a nonparametric statistical technique that adjusts the meta-analytic mean. It does so by
estimating effects that appear to be missing if a fixed-effects model and no
systematic bias are assumed.) Additionally, Li tested for several subgroup differences
between studies.
Vafaee, A&HL5575 (Class 14)
17
Example 2
Results:
The overall d value for CF according to the FE model was .61 (RE = .64). Moderator
results were also found for feedback types, delayed effects, and different contexts
(e.g., classroom vs. lab; see also Lyster & Saito, 2010). There was some evidence of
publication bias, yet the effect sizes from the 11 non- published dissertations in this
study were larger on average than in published studies.
Vafaee, A&HL5575 (Class 14)
18