Abstract

In the present article, we explore the extent to which previous research on register variation can be used to predict spoken/written task-type variation as well as differences across score levels in the context of a major standardized language exam (TOEFL iBT). Specifically, we carry out two sets of linguistic analyses based on a large corpus of TOEFL iBT responses: one investigating the use of 23 grammatical complexity features, and the second based on co-occurrence patterns among linguistic features, using Multi-Dimensional (MD) analysis. The first set of analyses confirms the predictions from research on register variation: there are systematic linguistic differences among spoken versus written and independent versus integrated task types. However, hypothesized developmental progressions in the use of these grammatical complexity features were generally not confirmed by score-level differences. In contrast, the MD analysis yielded more robust predictors of both task types and score-level differences, indicating that linguistic descriptions are more reliable and informative when they are based on dimensions of co-occurring lexico-grammatical features. In conclusion, we discuss the application of such dimensions as holistic complexity measures in language development/testing research.

1. INTRODUCTION

Task variation has been approached from multiple perspectives within applied linguistics. Second Language Acquisition (SLA) researchers have usually focused on the ways in which the cognitive complexity of a task influences the language produced by students ( Skehan 1998 ; Robinson 2001 ). Practitioners in language teaching and testing, on the other hand, often approach tasks from a broader perspective, including consideration of spoken and written tasks from a range of different communicative purposes ( Yuan and Ellis 2003 ; Lu 2011 ). Similarly, major standardized English language exams (e.g. TOEFL iBT, IELTS, PTE-Academic) have been designed to assess performance in both spoken and written tasks, including multiple task types in each mode (e.g. ‘independent’ versus ‘integrated’ task types in the case of TOEFL iBT).

Despite the recognition of their importance, there has been relatively little research carried out to date to explore the linguistic differences among the task types included on standardized exams ( Brown et al. 2005 ; Cumming et al. 2005 ; Banerjee et al. 2007 ). In addition to being few in number, previous linguistic descriptions have been limited in three major respects. First of all, these studies have been restricted mostly to vocabulary measures and holistic grammatical measures that are difficult to interpret linguistically (e.g. T-unit length—see discussion in Section 1.3). Secondly, no previous study has directly compared the linguistic characteristics of spoken versus written texts that are produced for standardized language exams. And finally, researchers have generally set out to simply establish the existence of linguistic differences, rather than exploring specific predictions concerning the expected patterns of linguistic variation among task types and modes of production.

The present study offers an approach that attempts to address these limitations. We begin with a brief survey of previous research on task types in applied linguistics, focusing especially on the linguistic analysis of task types included on standardized language exams (Section 1.1). Then, in Section 1.2, we survey previous linguistic research on register variation, discussing how the cumulative findings from this body of research enable specific predictions concerning the expected patterns of grammatical variation among spoken and written task types, and across proficiency levels, on standardized language exams.

Section 1.3 takes up a methodological issue underlying previous research on grammatical complexity in both applied linguistics and functional linguistics: what set of linguistic variables is best able to capture interpretable differences among texts produced for different task types and by speakers/writers at different proficiency levels? In this section, we compare the strengths and weaknesses of three different approaches to the measurement of grammatical complexity: traditional holistic measures based on T-units, corpus-based measures of specific grammatical characteristics (e.g. relative clauses and prepositional phrases) and ‘dimensions’ of co-occurring grammatical features identified through Multi-Dimensional (MD) analysis. We argue that linguistic variation has a functional basis, and that traditional holistic measures confound linguistic characteristics that are functionally distinct. Analysis of specific grammatical complexity features is more motivated from a linguistic perspective, because it acknowledges the functions and register distributions of each feature, enabling specific predictions concerning the expected patterns of variation. However, this approach can prove unwieldy because there are so many different grammatical characteristics to consider. Thus, MD analysis is offered as an alternative approach, resulting in holistic measures that have a strong empirical basis.

Building on this background, the main body of the article explores the extent to which the linguistic generalizations emerging from studies of register variation can be used to predict both spoken/written task-type variation as well as differences across proficiency levels in the context of a major standardized language exam (TOEFL iBT). Specifically, we carry out two sets of linguistic analyses based on analysis of a large corpus of TOEFL iBT responses: one investigating the use of 23 grammatical complexity features, and the second based on the co-occurrence patterns among linguistic features, using MD analysis. The results of these investigations show that it is possible to predict linguistic differences among task types based on the situational characteristics associated with spoken versus written, and independent versus integrated task types. In contrast, predicted differences among score levels are in general not supported by the analysis of the 23 individual complexity features. However, the MD analysis shows that linguistic predictions of both task-type and score-level differences are more reliable and informative when they are based on sets of co-occurring lexico-grammatical features (see also Jarvis et al. 2003 ; Friginal et al. 2014 ). Taken together, these analyses provide strong evidence for the claim that linguistic differences among task types and proficiency levels have a functional basis and are therefore systematic and predictable.

1.1 Task types in applied linguistics

There is general agreement in applied linguistics that consideration of tasks and task-type variation is important for language teaching, learning/acquisition, and assessment. Many university Intensive English Programs have adopted task-based curriculums, which structure language teaching and learning in terms of real-world tasks that students accomplish.

Much recent work in SLA has been motivated by the recognition that tasks differ in their cognitive complexity. A great deal of this research has focused on identifying the linguistic correlates of cognitively complex tasks ( Robinson 2007 ; Tavakoli and Foster 2008 ). A related line of research considers variation across more general task types associated with different communicative purposes (e.g. description, narration, exposition) and language produced with different degrees of pre-planning ( Beers and Nagy 2009 ; Ellis 2009 ; Way et al. 2000 ; Yuan and Ellis 2003 ). It is noteworthy that most SLA research on task variation has focused on either spoken tasks or written tasks; few studies have directly compared production in speech versus production in writing as a factor that might influence linguistic complexity (but see Ellis and Yuan 2005 ; Kormos and Trebits 2012 ; Kuiken and Vedder 2012 ). This restriction is surprising given the obvious differences in the planning and production circumstances of speech and writing: spoken texts are produced in real time with no possibility of editing or revising (apart from restating an utterance), even if opportunities for pre-planning exist. In contrast, written production (even if produced under time constraints) offers the opportunity to pause, plan, and edit during the production of a text. Thus, there is every reason to expect that language production in the spoken versus written mode is a crucially important component of task variation.

Recognizing this potential difference, the major standardized English language exams have been designed to assess performance in both speech and writing, including multiple task types in each mode. Assessment experts believe that we will achieve a more accurate understanding of language proficiency in relation to real-world target domains by measuring performance in multiple tasks (see Chapelle et al. 2008 : 2), as different types of tasks are expected to elicit different types of discourse and ‘thereby [broaden] representation of the domain of academic language on the test’ ( Enright and Tyson 2008 : 5).

For example, the TOEFL iBT exam consists of spoken and written tasks, including ‘independent’ and ‘integrated’ task types in each mode ( Enright and Tyson 2008 ). Independent tasks ask test takers to draw on their own experiences to create an argument. In contrast, integrated tasks ask test takers to listen to a lecture and/or read a passage, and then respond to a prompt that requires use of that information (see http://www.ets.org/toefl/ibt/about/content/ ). As Brown et al. (2005 : 1) note, integrated tasks can be considered ‘more complex and more demanding than more traditional stand-alone or independent tasks, in which test-takers draw on their own knowledge or ideas to respond to a question or prompt’. Instead, ‘test-takers are required to process and transform a cognitively complex stimulus (e.g. a written text or a lecture) and integrate information from this source into the … performance’ ( Brown et al. 2005 : 1).

The IELTS exam and the PTE-Academic also require test takers to produce language for multiple spoken and written task types. For example, the spoken task types in IELTS vary with respect to interactivity and communicative purpose, while the written task types similarly vary in their general communicative purposes (describing the information in a visual display versus arguing in support of a point of view; see http://www.ielts.org/test_takers_information/what_is_ielts/test_format.aspx ). The PTE-Academic similarly includes different task types in speech (e.g. describing an image, summarizing a lecture, and answering short questions) and writing (e.g. summarizing the content of a spoken lecture and writing a persuasive essay on a general academic topic; see Zheng and De Jong 2011 ).

Surprisingly, there has been relatively little research carried out to date to document the linguistic differences among the task types included on standardized exams. Brown et al. (2005) investigated spoken tasks on the TOEFL iBT, finding greater use of modal verbs in independent task types, but mixed results for grammatical complexity measures: some complexity measures (e.g. dependent clause ratio) were higher in independent tasks; other complexity measures (e.g. number of verb phrases per T-unit) were higher in integrated tasks; but dependent clause complexity measures were not significantly different across score levels. Cumming et al. (2006) and Banerjee et al. (2007) both focus exclusively on written task types (in the TOEFL iBT and IELTS, respectively). Cumming et al. found longer words on average in integrated task types, but mixed or weak results for type-token ratio and clauses per T-unit: clausal complexity was greater in independent tasks than in integrated tasks, and more proficient learners produced more words per T-unit, but there were no significant differences across score levels for clauses per T-unit. Banerjee et al. found a higher reliance on content words in the task explanation/description task, but unclear patterns for a grammatical complexity measure (dependent clauses per clause).

In addition to being few in number, previous linguistic descriptions of task-type variation on standardized language exams have been limited in scope. No previous study has directly compared the linguistic characteristics of spoken versus written task types on language exams. In addition, these studies have been restricted mostly to vocabulary measures and a few holistic grammatical measures based on T-units. And perhaps most surprisingly, researchers have generally not posited specific predictions concerning the expected patterns of linguistic variation among task types on standardized exams. It seems clear that the researchers who designed the structure of these exams expected that language produced for different tasks in speech versus writing would elicit use of different lexico-grammatical resources. However, no specific predictions concerning the expected patterns of linguistic variation among task types have been posited to date.

One major reason for this omission might be the failure to recognize the functional basis of linguistic variation. In contrast, research carried out under the umbrella of ‘register variation’ has a thoroughly functional basis, and thus provides the foundation for specific predictions concerning the expected patterns of linguistic variation across different task types on standardized language exams. We turn to an overview of this body of research in the following section.

1.2 Spoken and written register variation as the basis for predictions about task-type differences on language exams

Linguistic research on ‘registers’ provides a useful foundation for extending previous research on task-type variation. ‘Registers’ are textual varieties ‘associated with a particular situation of use (including particular communicative purposes)’ ( Biber and Conrad 2009 : 6). Registers can be compared with respect to several situational parameters, including production in the spoken versus written mode, opportunity for planning, revising, and editing, interactivity, and different communicative purposes ( Biber and Conrad 2009 : 36–47). The underlying assumption of such research is the claim that all linguistic variation is meaningful because it is functionally associated with the situational context ( Biber and Conrad 2009 : 6–15).

The situational basis of registers is important because it allows us to predict patterns of linguistic variation. That is, the register perspective assumes that ‘linguistic features tend to occur in a register because they are particularly well-suited to the purposes and situational context of the register’ ( Biber and Conrad 2009 : 6). Thus, by building on previous register findings, researchers are able to formulate specific hypotheses concerning the expected linguistic differences among task types by considering the situational contexts of those task types.

To give a simple example, the general communicative purpose of fictional narrative is to report past events typically involving third-person participants. As a result, texts from this register employ frequent past tense verbs, third-person pronouns, and time and place adverbials ( Biber 1988 : 135–138). In contrast, the general communicative purpose of academic books and articles is to present and explain current research findings, resulting in a frequent use of common nouns, nominalizations, relative clauses, and passive voice verbs ( Biber 1988 : 151–153). The Longman Grammar of Spoken and Written English ( Biber et al. 1999 ) documents the use of numerous other grammatical features associated differentially with written registers like fiction, newspaper reportage, or academic prose. While these registers do not provide a perfect match to the task-type distinctions employed in language testing, there are sufficient similarities to enable meaningful hypotheses concerning the predicted linguistic differences among many tasks with respect to a wide array of linguistic features.

One major focus of register research has been the linguistic differences between spoken and written discourse, associated with their different production circumstances, different possibilities for interaction, and different typical communicative purposes. In general, spoken/written differences are much larger than other register differences ( Biber 2006 : 186–191), indicating that the differences between spoken and written task types on language tests like IELTS and TOEFL are likely to be similarly significant. Based on previous register research ( Biber 1988 ; Biber et al. 1999 ), we are able to predict that spoken task types should generally rely on grammatical features like pronouns, modal verbs, lexical verbs, adverbs, and finite clauses; written task types are predicted to rely on grammatical features like nouns, nominalizations, attributive adjectives, and prepositional phrases.

The different types of dependent clauses in English have been a major concern of corpus-based register research, which has documented the dramatically different discourse functions served by these devices. Dependent clause types are distributed in strikingly different ways across registers, and surprisingly, many types of dependent clause are considerably more common in conversation than in academic writing. For example, finite complement clauses (e.g. I think today is Monday , I don't know why I did that ), and finite adverbial clauses (e.g. if -clauses and because -clauses) are 2–10 times more common in conversation than academic writing ( Biber et al. 2011 : 23–26). Finite relative clauses (e.g. the book that I told you about ) are used with much lower frequencies, in both conversation and academic writing. Non-finite relative clauses (e.g. the process developed by the team at XYU ) are one of the few clause types found more commonly in academic writing than in conversation, but the frequencies are low in absolute terms.

Previous corpus-based research has further shown that phrasal complexity features (e.g. prepositional phrases functioning as noun modifiers) are much more characteristic of advanced academic writing than dependent clause features ( Biber 1988 ; Biber and Gray 2010 ; Biber et al. 2011 , 2013 ). Halliday ( 1989 , 2004 ) has similarly argued from a more theoretical perspective that the linguistic complexity of speech is clausal (relying on dependent clauses) while the complexity of academic writing relies on nouns and nominalizations. Based on this previous research, we would predict a greater use of finite dependent clauses (especially complement clauses and adverbial clauses) in spoken task types, contrasted with a greater use of phrasal noun modifiers in written task types. By extension, it is possible to predict a greater use of clausal complexity features in independent/personal task types (with communicative purposes more similar to the typical purposes of conversation), contrasted with the expectation that integrated task types would make greater use of phrasal noun modifiers (because they have communicative purposes more similar to academic writing).

Building on this body of research, Biber et al. ( 2011 : 26–27) hypothesized a developmental sequence of grammatical complexity features for both L1-English and L2-English writers. This developmental progression is based on the assumption that students typically enter the university with conversational skills (and personal writing skills), while the complexity features of advanced academic writing are acquired later. The proposed sequence includes five developmental stages, but these can be summarized as a developmental progression along two grammatical parameters: Recent studies of both L1 and L2 writing development provide support for these hypothesized progressions. For example, we have undertaken an unpublished investigation of complexity features in the course writing of British university students (based on the British Academic Written English Corpus). Finite complement clauses, finite adverbial clauses, and even finite relative clauses declined in use over the 4 years of university study. In contrast, prepositional phrases and nouns as nominal modifiers increased in use, especially for students in the social sciences and natural sciences. Taguchi et al. (2013) use many of these same features to investigate L2-English student writing, finding that lower-rated essays used more finite/non-finite dependent clauses while higher rated essays used more attributive adjectives and post-noun-modifying prepositional phrases. Parkinson and Musgrave (2014) utilize a subset of the developmental stages from Biber et al. (2011) to compare L2 writing by English for Academic Purposes (EAP) students and matriculated MA students. EAP students used a lower proportion of nouns modifying other nouns and post-noun-modifying prepositional phrases when compared with MA-level writing.

  • Grammatical form:

  • finite dependent clause ⟶ non-finite dependent clause ⟶ dependent phrase

  • Syntactic function:

  • clause constituents (e.g. direct object or adverbial) ⟶ noun phrase modifiers

For the present study, we apply this framework to predict that students who achieve higher scores on standardized language exams will utilize phrasal complexity features to a greater extent than lower scoring students (see also Biber et al. 2011 : 30–31). One complication here is that exam scores reflect many factors in addition to grammatical proficiency. But it is at least reasonable to initially hypothesize a correlation between the two.

In addition, there is one other major research innovation to emerge from corpus-based studies of register variation that can contribute to investigations of language development: the importance of linguistic co-occurrence. We turn to that consideration in the following section.

1.2.1 MD analyses of register variation

From a linguistic perspective, a text is made up of dozens of co-occurring lexico-grammatical features. Although registers can be compared for the extent to which they use individual linguistic features, more robust descriptions are possible by considering the ways in which features co-occur. MD analysis is a research approach developed for that purpose: it identifies the important linguistic co-occurrence patterns—the ‘dimensions’—and it then compares registers along each of those dimensions.

There are several introductions to the MD approach, and numerous studies of register variation have employed this framework ( Biber 1988 , 1995 , 2006 , 2014 ; see also Friginal 2013 ). We thus provide only a brief overview here.

MD studies are based on large corpora of naturally occurring texts , representing the range of registers in a discourse domain. The first step in the analysis is to analyse the distribution of individual linguistic features in the corpus. Then, factor analysis is used to identify the systematic co-occurrence patterns among those linguistic features: the ‘dimensions’. Each dimension comprises a group of linguistic features that tend to co-occur in texts (e.g. nouns, attributive adjectives, prepositional phrases). The dimensions are interpreted to assess their underlying functional associations, and then texts and registers can be compared along each dimension to describe the overall patterns of register variation.

One of the most important findings from MD analyses is the existence of a basic oral-literate dimension, identified in nearly all previous MD studies ( Biber 1988 , 2006 ). Linguistically, this opposition is realized as two fundamentally different ways of constructing discourse: clausal versus phrasal. That is, across studies the oral pole of this dimension consists of verb classes (e.g. mental verbs, communication verbs), grammatical characteristics of verb phrases (e.g. present tense, progressive aspect), and modifiers of verbs and clauses (e.g. adverbs and stance adverbials). Interestingly, in most studies these ‘oral’ features also include clausal complexity features: dependent clauses that function as clause constituents, including adverbial clauses and finite complement clauses. In contrast, the ‘literate’ pole of this dimension usually consists of phrasal devices that mostly function as elements of noun phrases, especially nouns, nominalizations, attributive adjectives, and prepositional phrases. In nearly every case, this parameter is the first dimension identified by the statistical factor analysis (i.e. it is the most important factor, accounting for the greatest amount of shared variance). Functionally, this dimension is interpreted as distinguishing between a personal/involved focus (personal stance, interactivity, and/or real-time production features) versus informational focus. It consistently distinguishes between spoken versus written registers, and distinguishes among registers with personal versus informational communicative purposes within each mode (e.g. personal letters versus academic writing within the written mode; see Biber 1988 ).

Similar patterns of variation have been documented across MD studies (of different discourse domains in English, and of spoken/written register variation in many different languages—see Biber 2014 ). This cumulative body of research provides a strong theoretical and empirical foundation for predictions concerning the linguistic differences among task types on language exams. In particular, we would predict a strong difference between spoken and written task types, with spoken tasks employing clausal linguistic features, and written tasks making greater use of noun phrases and phrasal modifiers. Further, within each mode, we would predict that tasks that require personal narration/opinion/description (i.e. independent tasks) would make greater use of clausal features, while tasks that require informational explanation (i.e. integrated tasks) would make greater use of nouns and phrasal modifiers.

There are two primary motivations for carrying out an MD analysis: linguistic and applied. The linguistic motivation is based on the theoretical claim that registers are best described with respect to patterns of linguistic co-occurrence (see also Ervin-Tripp 1972 ). MD analysis provides an empirical approach to identify the sets of co-occurring linguistic features that distinguish among texts and registers. There is every reason to believe that this approach will prove equally useful for analyses of proficiency differences. Jarvis et al. (2003) come to this same conclusion, arguing that: ‘the quality of a written text may depend less on the use of individual linguistic features than on how these features are used in tandem’ ( Jarvis et al. 2003 : 399). 1

The applied motivation for MD analysis relates to the quest for holistic complexity measures that can be employed for analyses of learner language. In practical terms, it is cumbersome to include analyses of numerous lexico-grammatical features, and difficult to decipher the general patterns of variation based on such analyses. MD analysis provides a holistic analysis of the entire system, by identifying the ways in which individual linguistic features function together in texts (based on their statistical co-occurrence patterns). As such, the dimensions resulting from an MD analysis provide a potentially useful set of holistic measures for testing applications. We turn to this possibility in the following section.

1.3 Grammatical complexity measures in applied linguistics

Many studies of task-type variation and writing development, similar to studies of register variation, have focused on grammatical complexity. However, complexity is measured in fundamentally distinct ways in these different research traditions. Two major factors must be considered when evaluating the effectiveness of these approaches: Research carried out in the register tradition gives primary weight to the first factor. As discussed in Section 1.2 above, register research is based on the claim that all linguistic variation is functionally associated with the situational context ( Biber and Conrad 2009 : 6–15). This assumption defines all research in functional linguistics; for example: ‘Functionalists maintain that the communicative situation motivates, constrains, explains, or otherwise determines grammatical structure …’ ( Nichols 1984 : 97).

  • (1) Are the measures well-motivated and interpretable from a linguistic perspective?

  • (2) Do the measures parsimoniously represent the entire system of complexity?

The practical consequence of this assumption is that all grammatical variants are analysed on their own terms, expecting that there will be pragmatic/discourse factors that explain why a speaker or writer would choose one or the other variant. This is the approach adopted in the Longman Grammar of Spoken and Written English ( Biber et al. 1999 ) and in all previous research on register variation.

From this perspective, there are numerous grammatical devices associated with complexity, and so texts can be complex in very different ways in addition to being complex to differing extents (see discussion in Section 1.2). In recent years, this approach has also begun to be influential in applied linguistics; thus, Housen et al. ( 2012 : 8) note the recent attention given to ‘measures targeting more specific subdomains of language and more distinct linguistic features, as a complement to the use of more global measures’. Similarly, Biber et al. ( 2011 , 2013 ) advocate the utility of this approach for research in applied linguistics, providing the basis for the specific predictions about task-type variation investigated in the present study.

In contrast, most research on grammatical complexity in language development and testing has given primary weight to the second factor, seeking a few measures that holistically represent the entire system. Specifically, grammatical complexity has usually been measured through a few variables based on T-units (a main clause plus all associated dependent clauses), such as the mean length of T-units (words per T-unit) or dependent clauses per T-unit. The underlying assumption is that longer structures are more complex, regardless of the specific grammatical devices used to create the longer structure.

These two approaches have opposite strengths and weaknesses. The register/functional approach is well motivated linguistically, recognizing the functional characteristics of each lexico-grammatical feature, but it is not parsimonious: there are dozens of clausal and phrasal grammatical features in English associated with structural elaboration and complexity, and it is often not practical to consider all of those features in textual descriptions for applied purposes. In contrast, the T-unit approach provides a few holistic measures designed to capture the entire system of grammatical complexity. However, the T-unit approach is not well motivated linguistically, because it confounds the analysis of multiple grammatical features that have distinct functions and distributions. There is no empirical evidence to support the combination of this set of grammatical distinctions into a few holistic measures; in fact, corpus-based register/functional research has repeatedly demonstrated that these grammatical differences matter. 2

Researchers like Ortega, Norris, Byrnes, and Lu have argued for an intermediate position, retaining a parsimonious approach with a few holistic measures, but including measures that capture some aspects of phrasal (versus clausal) elaboration. Ortega ( 2003 : 514) suggests that we might expect a curvilinear relationship between clausal subordination and phrasal modification, with increased clausal subordination at lower levels of proficiency, but increased phrasal complexity and decreased clausal complexity at more advanced levels. For this reason, Ortega (2003) , Norris and Ortega (2009) , and Byrnes et al. (2010) advocate consideration of an additional holistic measure: mean length of (finite) clause. Finite clauses become longer with the addition of phrases and non-finite clauses, so this measure is interpreted as a reflection of phrasal elaboration, complementing the clausal elaboration captured by most T-unit measures. Lu (2011) similarly includes additional measures, such as complex phrases per clause/T-unit and complex nominals per clause/T-unit. While these studies take a step forward in addressing the functional differences among different types of complexity features, they still collapse multiple structural distinctions into a few measures, without empirical evidence to support those combinations.

In contrast, the ‘dimensions’ resulting from MD analyses are derived from empirical linguistic analyses of large text corpora. Thus, MD analysis attempts to achieve a parsimonious approach with a few holistic measures, while at the same time recognizing the full spectrum of structural/functional resources related to grammatical complexity in English. The major advantage over T-unit measures is that dimensions have a solid empirical basis: the grammatical features combined on a dimension have been empirically shown to co-occur in texts (through corpus analysis and a statistical factor analysis). Thus, these combinations of grammatical features are distributed in similar ways, and by extension, function in similar ways. In this way, dimensions provide a parsimonious holistic description of grammatical complexity, while also offering measures that are linguistically well motivated and interpretable. The following study illustrates the application of such measures to the description of task-type and score-level variation within the context of a major standardized language exam.

1.4 Overview of the study

Building on the body of research surveyed above, the present article investigates linguistic variation in the context of a major standardized language exam (TOEFL iBT). The main goal is to explore the extent to which the linguistic patterns found in register studies can be used to predict spoken/written task-type variation and score-level differences among test takers. For this purpose, we first analyse the distribution of 23 grammatical complexity features that have been described in previous register studies. The results of that analysis (Section 3 below) show that grammatical complexity features vary across spoken and written task types in the ways predicted by previous research, but these features are generally not associated with statistically significant differences across score levels.

This finding leads to the second research goal of the article: to explore the utility of linguistic ‘dimensions’ (sets of co-occurring linguistic features identified through MD analysis) as holistic predictors of variation among spoken and written task types as well as score levels. Section 4 presents the results of an MD analysis, showing that linguistic predictions of both task-type and score-level differences are more reliable and informative when they are based on sets of co-occurring lexico-grammatical features.

2. CORPUS DESIGN AND ANALYTICAL METHODS

The study is based on the linguistic analysis of a corpus of spoken and written texts produced by language learners taking the TOEFL iBT. The TOEFL iBT provides an ideal context for investigating these issues, because it incorporates independent versus integrated tasks in both the spoken and written modes, produced by learners at different proficiency levels.

The four major task types required for the exam can be described from a register perspective; Table 1 summarizes the major situational characteristics of each task. Independent tasks, in both speech and writing, ask test takers to draw on their own personal experiences to argue for a point of view. In contrast, integrated tasks are more informational in focus, asking test takers to listen to a lecture and/or read a passage, and then respond to a prompt that requires use of the information learned through listening and reading. Based on previous register studies, we predicted that the texts produced in the written mode would have increased grammatical complexity compared with texts produced in the spoken mode, due to the increased opportunities for planning, revising, and editing. We further predicted that integrated texts would have increased grammatical complexity compared with independent texts, associated with the communicative purposes of explaining information rather than reporting personal experiences.

Table 1:

Summary of some major situational characteristics of the TOEFL iBT text categories

Text category Mode of production Planning/editing time Support from external text Communicative purposes 
Spoken independent Speech Minimal: 15 second planning time; 45 second response None Give personal opinions based on individual personal experiences 
Spoken integrated Speech Little: 20 second planning time; 60 second response Yes–both written and spoken texts Describe/summarize the information in the external texts; sometimes also take a position 
Written independent Writing Considerable: 30 minutes to plan and write None Give personal opinions about life choices or general issues 
Written integrated Writing Considerable: 20 minutes to plan and write Yes–both written and spoken texts Describe/summarize the information in the external texts 
Text category Mode of production Planning/editing time Support from external text Communicative purposes 
Spoken independent Speech Minimal: 15 second planning time; 45 second response None Give personal opinions based on individual personal experiences 
Spoken integrated Speech Little: 20 second planning time; 60 second response Yes–both written and spoken texts Describe/summarize the information in the external texts; sometimes also take a position 
Written independent Writing Considerable: 30 minutes to plan and write None Give personal opinions about life choices or general issues 
Written integrated Writing Considerable: 20 minutes to plan and write Yes–both written and spoken texts Describe/summarize the information in the external texts 
Table 1:

Summary of some major situational characteristics of the TOEFL iBT text categories

Text category Mode of production Planning/editing time Support from external text Communicative purposes 
Spoken independent Speech Minimal: 15 second planning time; 45 second response None Give personal opinions based on individual personal experiences 
Spoken integrated Speech Little: 20 second planning time; 60 second response Yes–both written and spoken texts Describe/summarize the information in the external texts; sometimes also take a position 
Written independent Writing Considerable: 30 minutes to plan and write None Give personal opinions about life choices or general issues 
Written integrated Writing Considerable: 20 minutes to plan and write Yes–both written and spoken texts Describe/summarize the information in the external texts 
Text category Mode of production Planning/editing time Support from external text Communicative purposes 
Spoken independent Speech Minimal: 15 second planning time; 45 second response None Give personal opinions based on individual personal experiences 
Spoken integrated Speech Little: 20 second planning time; 60 second response Yes–both written and spoken texts Describe/summarize the information in the external texts; sometimes also take a position 
Written independent Writing Considerable: 30 minutes to plan and write None Give personal opinions about life choices or general issues 
Written integrated Writing Considerable: 20 minutes to plan and write Yes–both written and spoken texts Describe/summarize the information in the external texts 

The corpus for our analysis is part of the TOEFL iBT Public Use Dataset , comprising 480 spoken exams (including responses to 6 items per exam, for a total of 2879 spoken responses) and 480 written exams (2 items per exam, for a total of 960 written responses), taken from two forms of the TOEFL iBT. 3 Each individual response had been previously assigned a holistic score by TOEFL iBT raters. Spoken and written responses were originally scored using different scales, but we transformed all scores to a 4-point scale to allow direct comparisons among the task types (see details in Biber and Gray 2013 : 12).

For the quantitative linguistic analyses, we excluded all texts shorter than 100 words. There were two motivations for that decision: (i) the unreliability of quantitative rates of occurrence measured in very short text samples, and (ii) the confounding influence of text length and TOEFL iBT score. The resulting corpus composition is shown in Table 2 . 4

Table 2:

Final corpus for analysis

Task Score level Number of texts Sub-corpus size Mean text length Minimum text length Maximum text length 
Spoken independent tasks (two responses per test taker) — — — — — 
39 4,376 112.2 101 140 
142 16,245 114.4 101 172 
67 7,953 118.7 101 164 
Subtotal  248 28,574    
Spoken integrated tasks (four responses per test taker) — — — — — 
313 37,153 118.7 101 195 
654 84,104 128.6 101 213 
216 30,521 141.3 101 212 
Subtotal  1,183 151,783    
Written independent tasks (1 response per test taker) 42 9,597 228.5 123 351 
177 51,118 288.8 160 507 
155 52,452 338.4 206 549 
102 39,300 385.3 261 586 
Subtotal  476 152,467    
Written integrated tasks (1 response per test taker) 119 20,587 173.0 101 293 
118 23,683 200.7 102 303 
122 25,962 212.8 108 367 
112 26,264 234.5 145 388 
Subtotal  471 96,496    
Total  2,378 429,320    
Task Score level Number of texts Sub-corpus size Mean text length Minimum text length Maximum text length 
Spoken independent tasks (two responses per test taker) — — — — — 
39 4,376 112.2 101 140 
142 16,245 114.4 101 172 
67 7,953 118.7 101 164 
Subtotal  248 28,574    
Spoken integrated tasks (four responses per test taker) — — — — — 
313 37,153 118.7 101 195 
654 84,104 128.6 101 213 
216 30,521 141.3 101 212 
Subtotal  1,183 151,783    
Written independent tasks (1 response per test taker) 42 9,597 228.5 123 351 
177 51,118 288.8 160 507 
155 52,452 338.4 206 549 
102 39,300 385.3 261 586 
Subtotal  476 152,467    
Written integrated tasks (1 response per test taker) 119 20,587 173.0 101 293 
118 23,683 200.7 102 303 
122 25,962 212.8 108 367 
112 26,264 234.5 145 388 
Subtotal  471 96,496    
Total  2,378 429,320    
Table 2:

Final corpus for analysis

Task Score level Number of texts Sub-corpus size Mean text length Minimum text length Maximum text length 
Spoken independent tasks (two responses per test taker) — — — — — 
39 4,376 112.2 101 140 
142 16,245 114.4 101 172 
67 7,953 118.7 101 164 
Subtotal  248 28,574    
Spoken integrated tasks (four responses per test taker) — — — — — 
313 37,153 118.7 101 195 
654 84,104 128.6 101 213 
216 30,521 141.3 101 212 
Subtotal  1,183 151,783    
Written independent tasks (1 response per test taker) 42 9,597 228.5 123 351 
177 51,118 288.8 160 507 
155 52,452 338.4 206 549 
102 39,300 385.3 261 586 
Subtotal  476 152,467    
Written integrated tasks (1 response per test taker) 119 20,587 173.0 101 293 
118 23,683 200.7 102 303 
122 25,962 212.8 108 367 
112 26,264 234.5 145 388 
Subtotal  471 96,496    
Total  2,378 429,320    
Task Score level Number of texts Sub-corpus size Mean text length Minimum text length Maximum text length 
Spoken independent tasks (two responses per test taker) — — — — — 
39 4,376 112.2 101 140 
142 16,245 114.4 101 172 
67 7,953 118.7 101 164 
Subtotal  248 28,574    
Spoken integrated tasks (four responses per test taker) — — — — — 
313 37,153 118.7 101 195 
654 84,104 128.6 101 213 
216 30,521 141.3 101 212 
Subtotal  1,183 151,783    
Written independent tasks (1 response per test taker) 42 9,597 228.5 123 351 
177 51,118 288.8 160 507 
155 52,452 338.4 206 549 
102 39,300 385.3 261 586 
Subtotal  476 152,467    
Written integrated tasks (1 response per test taker) 119 20,587 173.0 101 293 
118 23,683 200.7 102 303 
122 25,962 212.8 108 367 
112 26,264 234.5 145 388 
Subtotal  471 96,496    
Total  2,378 429,320    

The entire corpus was automatically annotated for lexico-grammatical features using the Biber Tagger ( Biber 1988 ; Biber et al. 1999 ). We then undertook a detailed process of tag-checking, followed by development of specific computer programs for automatic and/or manual tag correction for problematic grammatical features (described in Biber and Gray 2013 : 15–17). That process was applied cyclically until we achieved a high level of accuracy for all targeted linguistic features (‘precision’ and ‘recall’ rates over 90%; Biber and Gray 2013 : Appendices III and IV). Next, we developed additional computer programs to analyse the quantitative distribution of linguistic features, ‘normalized’ to a rate per 1,000 words of text, so that quantitative measures are comparable across texts regardless of text length.

In Biber and Gray (2013) , we report on the distribution and use of numerous lexical, phraseological, and grammatical features in these task types. However, in the present article, we focus on 23 linguistic features that have been associated with grammatical complexity. Rather than using holistic measures such as T-unit length or number of dependent clauses per T-unit, we analysed a wide range of grammatical features that have been motivated by previous empirical research on register variation (see discussion in Sections 1.1–1.4 above). These include grammatical classes (e.g. nouns, adverbs, passive voice verbs, linking adverbials), particular types of dependent clauses (e.g. that complement clauses controlled by verbs and nouns, finite adverbial clauses, non-finite complement clauses, finite relative clauses, and non-finite relative clauses), and, most importantly, a wide range of phrasal structures used for modification and elaboration (e.g. attributive adjectives, nouns as nominal pre-modifiers, prepositional phrases). The following section reports on the use of these 23 linguistic features, considering their distributions across modes, tasks, and proficiency levels. Then, in Section 4, we present the results of an MD analysis, showing how these linguistic features work together as co-occurring sets that more accurately predict mode/task/proficiency differences than linguistic features considered individually.

3. TASK, MODE, AND SCORE-LEVEL DIFFERENCES FOR INDIVIDUAL GRAMMATICAL COMPLEXITY FEATURES

To investigate linguistic variation in the TOEFL iBT domain, we undertook factorial analyses with mode, task, and score level as predictors of the use of 23 grammatical complexity features. We used General Linear Models (GLM) in SAS for the statistical analyses, setting an experiment-wise required probability level of p < .001 (i.e. .05/23 = .002). 5 Four categorical variables were used as independent variables: mode (spoken or written), task (independent or integrated), score level (1, 2, 3, 4), and test taker. The last variable is required because most of the test takers included in our sample produced multiple texts included in the corpus. Statistically, this is a type of repeated measure design, and thus it is necessary to control for the possible influence of individual student. However, in this case, this variation is also of theoretical interest, because such variation might reflect patterns of individual language use or development.

For those models that were significant, we considered the effects of each independent variable and all interactions using Type III sums of squares. (This approach includes variation that is unique to an effect after adjusting for all other effects that are included in the model and is especially important in the present study because the subcategory samples are not balanced.) Because these post hoc findings are more exploratory, we report all differences that are significant at the level of p < .05.

Table 3 summarizes the results of the factorial comparisons. Most of these features are associated with significant and important differences in the TOEFL iBT Corpus, with overall model r2 values ranging from c. 40 to c. 75 per cent. These significant models are mostly associated with strong differences between the spoken and written modes and with differences between independent versus integrated tasks. In addition, 16 of these features have significant interaction effects between mode and task type. These findings highlight the importance of task-type differences in the TOEFL iBT. The interaction effects also indicate that task communicative purpose (i.e. integrated versus independent tasks) impacted the use of grammatical complexity features differently depending on the mode.

Table 3:

Summary of the full factorial models for 23 grammatical features associated with complexity

Linguistic feature Model R2  Mode a (SP/WR)  Task Score level Mode × Task Mode × Score Task × Score Mode × Task × Score Test taker 
Word length <.0001 .652 <.0001 <.0001 <.01 <.0001 ns <.05 ns <.0001 
Passive voice verbs <.0001 .539 <.0001 <.0001 ns <.0001 ns <.001 ns <.001 
Clausal and <.0001 .558 ns <.05 ns <.05 ns ns ns <.0001 
Adverbs <.0001 .483 <.0001 <.0001 ns <.0001 ns ns ns <.01 
Linking adverbials <.0001 .442 ns <.05 ns <.05 ns ns ns <.0001 
Nouns <.0001 .702 <.0001 <.0001 ns <.0001 ns <.001 ns <.0001 
Nominalizations <.0001 .774 <.0001 <.001 ns <.001 <.01 <.01 <.01 <.0001 
Prepositional phrases <.0001 .557 <.0001 <.05 ns ns ns ns ns <.0001 
Of genitives <.0001 .509 <.0001 <.0001 ns <.0001 ns ns ns <.0001 
Attributive adjectives <.0001 .474 <.0001 <.0001 <.05 ns <.05 ns ns <.001 
Premodifying nouns <.0001 .564 <.0001 <.0001 ns <.0001 ns <.01 ns <.0001 
Finite adverbial clauses <.0001 .421 <.0001 <.0001 ns <.05 ns ns ns <.001 
WH complement clauses  ns          
Verb + that -clause  <.0001 .533 ns <.0001 0.05 <.0001 ns ns ns <.0001 
Adjective + that -clause  ns          
Noun + that -clause  <.0001 .472 <.05 <.0001 ns <.0001 ns ns ns <.0001 
Verb + to -clause  ns          
Desire verb + to -clause  <.0001 .413 ns <.0001 ns <.0001 ns <.05 ns ns 
Adjective + to -clause  <.0001 .429 ns <.0001 ns ns ns ns ns <.01 
Noun + to -clause  <.0001 .484 <.0001 <.0001 ns <.0001 ns <.05 ns <.0001 
Verb + ing -clause  ns          
Finite relative clauses <.0001 .408 ns <.01 ns <.01 ns ns ns <.0001 
Passive non-finite relative clause <.0001 .508 <.0001 <.0001 ns <.0001 ns <.0001 ns <.0001 
Linguistic feature Model R2  Mode a (SP/WR)  Task Score level Mode × Task Mode × Score Task × Score Mode × Task × Score Test taker 
Word length <.0001 .652 <.0001 <.0001 <.01 <.0001 ns <.05 ns <.0001 
Passive voice verbs <.0001 .539 <.0001 <.0001 ns <.0001 ns <.001 ns <.001 
Clausal and <.0001 .558 ns <.05 ns <.05 ns ns ns <.0001 
Adverbs <.0001 .483 <.0001 <.0001 ns <.0001 ns ns ns <.01 
Linking adverbials <.0001 .442 ns <.05 ns <.05 ns ns ns <.0001 
Nouns <.0001 .702 <.0001 <.0001 ns <.0001 ns <.001 ns <.0001 
Nominalizations <.0001 .774 <.0001 <.001 ns <.001 <.01 <.01 <.01 <.0001 
Prepositional phrases <.0001 .557 <.0001 <.05 ns ns ns ns ns <.0001 
Of genitives <.0001 .509 <.0001 <.0001 ns <.0001 ns ns ns <.0001 
Attributive adjectives <.0001 .474 <.0001 <.0001 <.05 ns <.05 ns ns <.001 
Premodifying nouns <.0001 .564 <.0001 <.0001 ns <.0001 ns <.01 ns <.0001 
Finite adverbial clauses <.0001 .421 <.0001 <.0001 ns <.05 ns ns ns <.001 
WH complement clauses  ns          
Verb + that -clause  <.0001 .533 ns <.0001 0.05 <.0001 ns ns ns <.0001 
Adjective + that -clause  ns          
Noun + that -clause  <.0001 .472 <.05 <.0001 ns <.0001 ns ns ns <.0001 
Verb + to -clause  ns          
Desire verb + to -clause  <.0001 .413 ns <.0001 ns <.0001 ns <.05 ns ns 
Adjective + to -clause  <.0001 .429 ns <.0001 ns ns ns ns ns <.01 
Noun + to -clause  <.0001 .484 <.0001 <.0001 ns <.0001 ns <.05 ns <.0001 
Verb + ing -clause  ns          
Finite relative clauses <.0001 .408 ns <.01 ns <.01 ns ns ns <.0001 
Passive non-finite relative clause <.0001 .508 <.0001 <.0001 ns <.0001 ns <.0001 ns <.0001 

a SP = Speaking; WR = Writing.

Table 3:

Summary of the full factorial models for 23 grammatical features associated with complexity

Linguistic feature Model R2  Mode a (SP/WR)  Task Score level Mode × Task Mode × Score Task × Score Mode × Task × Score Test taker 
Word length <.0001 .652 <.0001 <.0001 <.01 <.0001 ns <.05 ns <.0001 
Passive voice verbs <.0001 .539 <.0001 <.0001 ns <.0001 ns <.001 ns <.001 
Clausal and <.0001 .558 ns <.05 ns <.05 ns ns ns <.0001 
Adverbs <.0001 .483 <.0001 <.0001 ns <.0001 ns ns ns <.01 
Linking adverbials <.0001 .442 ns <.05 ns <.05 ns ns ns <.0001 
Nouns <.0001 .702 <.0001 <.0001 ns <.0001 ns <.001 ns <.0001 
Nominalizations <.0001 .774 <.0001 <.001 ns <.001 <.01 <.01 <.01 <.0001 
Prepositional phrases <.0001 .557 <.0001 <.05 ns ns ns ns ns <.0001 
Of genitives <.0001 .509 <.0001 <.0001 ns <.0001 ns ns ns <.0001 
Attributive adjectives <.0001 .474 <.0001 <.0001 <.05 ns <.05 ns ns <.001 
Premodifying nouns <.0001 .564 <.0001 <.0001 ns <.0001 ns <.01 ns <.0001 
Finite adverbial clauses <.0001 .421 <.0001 <.0001 ns <.05 ns ns ns <.001 
WH complement clauses  ns          
Verb + that -clause  <.0001 .533 ns <.0001 0.05 <.0001 ns ns ns <.0001 
Adjective + that -clause  ns          
Noun + that -clause  <.0001 .472 <.05 <.0001 ns <.0001 ns ns ns <.0001 
Verb + to -clause  ns          
Desire verb + to -clause  <.0001 .413 ns <.0001 ns <.0001 ns <.05 ns ns 
Adjective + to -clause  <.0001 .429 ns <.0001 ns ns ns ns ns <.01 
Noun + to -clause  <.0001 .484 <.0001 <.0001 ns <.0001 ns <.05 ns <.0001 
Verb + ing -clause  ns          
Finite relative clauses <.0001 .408 ns <.01 ns <.01 ns ns ns <.0001 
Passive non-finite relative clause <.0001 .508 <.0001 <.0001 ns <.0001 ns <.0001 ns <.0001 
Linguistic feature Model R2  Mode a (SP/WR)  Task Score level Mode × Task Mode × Score Task × Score Mode × Task × Score Test taker 
Word length <.0001 .652 <.0001 <.0001 <.01 <.0001 ns <.05 ns <.0001 
Passive voice verbs <.0001 .539 <.0001 <.0001 ns <.0001 ns <.001 ns <.001 
Clausal and <.0001 .558 ns <.05 ns <.05 ns ns ns <.0001 
Adverbs <.0001 .483 <.0001 <.0001 ns <.0001 ns ns ns <.01 
Linking adverbials <.0001 .442 ns <.05 ns <.05 ns ns ns <.0001 
Nouns <.0001 .702 <.0001 <.0001 ns <.0001 ns <.001 ns <.0001 
Nominalizations <.0001 .774 <.0001 <.001 ns <.001 <.01 <.01 <.01 <.0001 
Prepositional phrases <.0001 .557 <.0001 <.05 ns ns ns ns ns <.0001 
Of genitives <.0001 .509 <.0001 <.0001 ns <.0001 ns ns ns <.0001 
Attributive adjectives <.0001 .474 <.0001 <.0001 <.05 ns <.05 ns ns <.001 
Premodifying nouns <.0001 .564 <.0001 <.0001 ns <.0001 ns <.01 ns <.0001 
Finite adverbial clauses <.0001 .421 <.0001 <.0001 ns <.05 ns ns ns <.001 
WH complement clauses  ns          
Verb + that -clause  <.0001 .533 ns <.0001 0.05 <.0001 ns ns ns <.0001 
Adjective + that -clause  ns          
Noun + that -clause  <.0001 .472 <.05 <.0001 ns <.0001 ns ns ns <.0001 
Verb + to -clause  ns          
Desire verb + to -clause  <.0001 .413 ns <.0001 ns <.0001 ns <.05 ns ns 
Adjective + to -clause  <.0001 .429 ns <.0001 ns ns ns ns ns <.01 
Noun + to -clause  <.0001 .484 <.0001 <.0001 ns <.0001 ns <.05 ns <.0001 
Verb + ing -clause  ns          
Finite relative clauses <.0001 .408 ns <.01 ns <.01 ns ns ns <.0001 
Passive non-finite relative clause <.0001 .508 <.0001 <.0001 ns <.0001 ns <.0001 ns <.0001 

a SP = Speaking; WR = Writing.

For those features that are associated with significant differences, Table 4 summarizes the patterns of use found in the post hoc analyses, identifying the particular mode/task/score level that used the feature most. The features listed in Table 4 are grouped to highlight those that behave in similar ways. Two major categories of grammatical complexity features emerge from the analysis: The complexity features associated with speech and independent tasks include total adverbs, finite adverbial clauses, and desire verbs (especially want ) controlling a to -clause. Clauses connected by and are also more common in speech, but tend to be used in (lower-scoring) integrated responses.

  1. those that are more frequent in speech and in independent tasks;

  2. those that are more frequent in writing and in integrated tasks; some of these features are also more common in high-scoring responses.

Table 4:

Summary of the major patterns for linguistic features across mode (speaking versus writing), task (independent versus integrated), and score level

Linguistic features that are generally more common in speech, independent tasks, and lower score levels  Mode a
 
Task b
 
Score level
 
Interactions 
 SP WR IND INT  
Adverbs ++  ++    Most in spoken independent texts 
Finite adverbial clauses ++  ++    Most in spoken independent texts; rare in written integrated texts 
Desire verb + to -clause    ++    Most in spoken independent low-scoring texts 
Clausal and      More in spoken integrated low-scoring texts 
Linguistic features that are generally more common in speech, independent tasks, and lower score levels  Mode a
 
Task b
 
Score level
 
Interactions 
 SP WR IND INT  
Adverbs ++  ++    Most in spoken independent texts 
Finite adverbial clauses ++  ++    Most in spoken independent texts; rare in written integrated texts 
Desire verb + to -clause    ++    Most in spoken independent low-scoring texts 
Clausal and      More in spoken integrated low-scoring texts 
Linguistic features that are generally more common in writing, integrated tasks, and higher score levels  Mode
 
Task
 
Score level
 
Interactions 
 SP WR IND INT  
Word length  ++  ++  Spoken independent has the shortest words; written integrated high-scoring has longer words 
Passive voice verbs  ++  ++   More in written integrated high-scoring texts 
Nouns  ++  ++   Most in written integrated texts 
Nominalizations  ++  ++   Most in written integrated texts 
Prepositions  ++    Most in written integrated texts 
Noun + of -phrase   ++  ++   Most in written integrated texts 
Attributive adjectives  ++  ++  Most in written integrated high-scoring texts 
Premodifying nouns  ++  ++   Most in written integrated texts 
Verb + that -clause     ++  Most in high-scoring written integrated texts 
Noun + that -clause    ++   Most in written integrated texts 
Passive –ed relative clauses   ++  ++   Most in high-scoring written integrated texts 
Noun + to -clause   ++ ++    Most in written independent texts 
Finite relative clauses      More in written integrated texts 
Linguistic features that are generally more common in writing, integrated tasks, and higher score levels  Mode
 
Task
 
Score level
 
Interactions 
 SP WR IND INT  
Word length  ++  ++  Spoken independent has the shortest words; written integrated high-scoring has longer words 
Passive voice verbs  ++  ++   More in written integrated high-scoring texts 
Nouns  ++  ++   Most in written integrated texts 
Nominalizations  ++  ++   Most in written integrated texts 
Prepositions  ++    Most in written integrated texts 
Noun + of -phrase   ++  ++   Most in written integrated texts 
Attributive adjectives  ++  ++  Most in written integrated high-scoring texts 
Premodifying nouns  ++  ++   Most in written integrated texts 
Verb + that -clause     ++  Most in high-scoring written integrated texts 
Noun + that -clause    ++   Most in written integrated texts 
Passive –ed relative clauses   ++  ++   Most in high-scoring written integrated texts 
Noun + to -clause   ++ ++    Most in written independent texts 
Finite relative clauses      More in written integrated texts 

Based on significance for the main effects and interaction effects, considered together with the descriptive statistics for each group. + marks main effects at <.05; ++ marks main effects at <.001.

a SP = Speaking; WR = Writing. b IND = Independent; INT = Integrated.

Table 4:

Summary of the major patterns for linguistic features across mode (speaking versus writing), task (independent versus integrated), and score level

Linguistic features that are generally more common in speech, independent tasks, and lower score levels  Mode a
 
Task b
 
Score level
 
Interactions 
 SP WR IND INT  
Adverbs ++  ++    Most in spoken independent texts 
Finite adverbial clauses ++  ++    Most in spoken independent texts; rare in written integrated texts 
Desire verb + to -clause    ++    Most in spoken independent low-scoring texts 
Clausal and      More in spoken integrated low-scoring texts 
Linguistic features that are generally more common in speech, independent tasks, and lower score levels  Mode a
 
Task b
 
Score level
 
Interactions 
 SP WR IND INT  
Adverbs ++  ++    Most in spoken independent texts 
Finite adverbial clauses ++  ++    Most in spoken independent texts; rare in written integrated texts 
Desire verb + to -clause    ++    Most in spoken independent low-scoring texts 
Clausal and      More in spoken integrated low-scoring texts 
Linguistic features that are generally more common in writing, integrated tasks, and higher score levels  Mode
 
Task
 
Score level
 
Interactions 
 SP WR IND INT  
Word length  ++  ++  Spoken independent has the shortest words; written integrated high-scoring has longer words 
Passive voice verbs  ++  ++   More in written integrated high-scoring texts 
Nouns  ++  ++   Most in written integrated texts 
Nominalizations  ++  ++   Most in written integrated texts 
Prepositions  ++    Most in written integrated texts 
Noun + of -phrase   ++  ++   Most in written integrated texts 
Attributive adjectives  ++  ++  Most in written integrated high-scoring texts 
Premodifying nouns  ++  ++   Most in written integrated texts 
Verb + that -clause     ++  Most in high-scoring written integrated texts 
Noun + that -clause    ++   Most in written integrated texts 
Passive –ed relative clauses   ++  ++   Most in high-scoring written integrated texts 
Noun + to -clause   ++ ++    Most in written independent texts 
Finite relative clauses      More in written integrated texts 
Linguistic features that are generally more common in writing, integrated tasks, and higher score levels  Mode
 
Task
 
Score level
 
Interactions 
 SP WR IND INT  
Word length  ++  ++  Spoken independent has the shortest words; written integrated high-scoring has longer words 
Passive voice verbs  ++  ++   More in written integrated high-scoring texts 
Nouns  ++  ++   Most in written integrated texts 
Nominalizations  ++  ++   Most in written integrated texts 
Prepositions  ++    Most in written integrated texts 
Noun + of -phrase   ++  ++   Most in written integrated texts 
Attributive adjectives  ++  ++  Most in written integrated high-scoring texts 
Premodifying nouns  ++  ++   Most in written integrated texts 
Verb + that -clause     ++  Most in high-scoring written integrated texts 
Noun + that -clause    ++   Most in written integrated texts 
Passive –ed relative clauses   ++  ++   Most in high-scoring written integrated texts 
Noun + to -clause   ++ ++    Most in written independent texts 
Finite relative clauses      More in written integrated texts 

Based on significance for the main effects and interaction effects, considered together with the descriptive statistics for each group. + marks main effects at <.05; ++ marks main effects at <.001.

a SP = Speaking; WR = Writing. b IND = Independent; INT = Integrated.

The features associated with writing and integrated tasks are mostly noun phrase features: nouns, nominalizations, and noun phrase modifiers (prepositional phrases, noun + of -phrase, attributive adjectives, premodifying nouns, noun + that -clause, noun + to -clause, and passive –ed relative clauses). Longer words, which can be morphologically derived forms, also have the same distribution. In addition, a few verbal/clausal features are associated with writing and/or integrated tasks: passive voice verbs and verb + that -clause constructions.

These patterns show that there are strong differences in the language produced in response to different task types. If we interpret these findings by reference to recent research on grammatical complexity in spoken and written registers ( Biber 2009 ; Biber and Gray 2010 ; Biber et al. 2011 ), the patterns conform to our expectations for task-type variation. Specifically, when we consider phrasal complexity features (especially modifiers of complex noun phrases), we see that writing is more linguistically complex than speech, and that integrated tasks are more grammatically complex than independent tasks. The findings regarding dependent clause features confirm previous research showing that different clause types function in different ways. For example, finite adverbial clauses and verb + to -clauses are more common in (spoken) independent tasks than in (written) integrated tasks. In contrast, non-finite dependent clauses modifying nouns (i.e. passive –ed relative clauses and to noun complement clauses) are more common in written integrated tasks.

Somewhat surprisingly, score level is not a significant predictor of variation for most grammatical complexity features, either as an independent factor or in interaction with mode/task. ( Biber and Gray 2013 : 117–125 provides descriptive statistics for these linguistic differences.) The test-taker control variable, however, is a significant and strong predictor of variation for nearly all of these linguistic complexity features. That is, test takers vary considerably in their use of these linguistic features, but this variation has little or no relation to TOEFL iBT score level.

While this finding is disconcerting at first, it is perhaps not surprising when considered within the context of the validity argument for the TOEFL iBT, because score levels are not intended to directly measure linguistic development in the use of particular lexico-grammatical features. That is, grading rubrics for the TOEFL iBT specify numerous criteria (e.g. the extent to which the response is well organized and developed, uses clear explanations and examples, addresses the topic/task), and the use of appropriate lexico-grammatical features is given comparatively little weight. As a result, most grammatical features (including complexity features) show little relationship to the holistic ratings of quality represented by TOEFL iBT scores (see discussion in Biber and Gray 2013 : 64–68). In addition, the role of accuracy related to TOEFL iBT scores is not accounted for by this investigation.

However, a further explanation for the lack of relationship between individual grammatical features and score level is the possibility that instructors/raters and students/examinees are much more tuned in to constellations of linguistic features used effectively than they are to the use of any individual linguistic feature ( Jarvis et al. 2003 ; Friginal et al. 2014 ). As noted in Sections 1.3 and 1.4 above, analyses of register variation are more robust when they are based on sets of co-occurring linguistic features, and previous research suggests that linguistic analyses of learner proficiency will similarly be more informative when approached from that perspective. To explore those possibilities, we undertook an MD analysis of the patterns of linguistic variation in the TOEFL iBT corpus.

4. AN MD ANALYSIS OF TASK, MODE, AND SCORE-LEVEL DIFFERENCES ON THE TOEFL iBT

For the present MD analysis, we began with the grammatical complexity features analysed in Section 3 above, plus a few additional features that have functional importance (e.g. pronouns, past tense verbs). Some features were dropped from the analysis because they shared little variance with the overall factorial structure (as shown by communality estimates less than .15, or factor loadings less than ±.30 on all factors). Twenty-eight linguistic features were retained for the final factor analysis. A four-factor solution was chosen for the subsequent analysis, based on consideration of the scree plot, eigenvalues, and interpretability of the factors; taken together, these four factors account for 44% of the shared variance. 6 Following initial extraction, the factor solution was rotated using a Promax rotation. Only small correlations (less than ±.3) exist among the factors. Full details of the factor analysis, including the interpretation of all four factors, are provided in Biber and Gray ( 2013 : 50–62, Appendices K and L). Table 5 summarizes the important linguistic features loading on each factor (i.e. features with factor loadings over ±.3).

Table 5:

Summary of the important linguistic features loading on dimensions 1–4


 
Dimension 1: ‘Literate’ versus ‘Oral’ responses 

 
     Features with positive loadings: 
    nouns: common nouns (.64), concrete nouns (.64), premodifying nouns (.39) 
    prepositional phrases (.52), noun + of -phrase (.47)  
    adjectives: attributive (.61), topical (.40) 
    word length (.40) 
    passives: finite (.41), postnominal (.32) 
     Features with negative loadings: 
    verbs: present tense (−.33), mental verbs (−.62), modal verbs (−.36) 
    pronouns: 3 rd person (−.55)  
     that -clauses: controlled by likelihood verbs (−.45), that -omission (−.48)  
    finite adverbial clauses (−.31) 

 
Dimension 2: ‘Information source: Text versus personal experience’ 

 
     Features with positive loadings: 
    nouns (.37), place nouns (.45), premodifying nouns (.39) 
    3 rd person pronouns (.41)  
     that -clauses controlled by communication verbs (.68)  
    communication verbs (.80) 
     Features with negative loadings: 
    pronouns: 1 st person (−.33), 2 nd person (−.39)  
    abstract nouns (−.37) 

 
Dimension 3: ‘Abstract opinion versus Concrete description/summary’ 

 
     Features with positive loadings: 
    word length (.49) 
    nouns: nominalizations (.62), mental nouns (.51), abstract nouns (.38) 
    noun + to -complement clause (.33)  
    mental verbs (.31) 
     Features with negative loadings: 
    concrete nouns (−.38), activity verbs (−.47) 

 
Dimension 4: ‘Personal narration’ 

 
     Features with positive loadings: 
    1 st person pronouns (.35), past tense verbs (.74)  
     Features with negative loadings: 
    present tense verbs (−.70) 

 
Dimension 1: ‘Literate’ versus ‘Oral’ responses 

 
     Features with positive loadings: 
    nouns: common nouns (.64), concrete nouns (.64), premodifying nouns (.39) 
    prepositional phrases (.52), noun + of -phrase (.47)  
    adjectives: attributive (.61), topical (.40) 
    word length (.40) 
    passives: finite (.41), postnominal (.32) 
     Features with negative loadings: 
    verbs: present tense (−.33), mental verbs (−.62), modal verbs (−.36) 
    pronouns: 3 rd person (−.55)  
     that -clauses: controlled by likelihood verbs (−.45), that -omission (−.48)  
    finite adverbial clauses (−.31) 

 
Dimension 2: ‘Information source: Text versus personal experience’ 

 
     Features with positive loadings: 
    nouns (.37), place nouns (.45), premodifying nouns (.39) 
    3 rd person pronouns (.41)  
     that -clauses controlled by communication verbs (.68)  
    communication verbs (.80) 
     Features with negative loadings: 
    pronouns: 1 st person (−.33), 2 nd person (−.39)  
    abstract nouns (−.37) 

 
Dimension 3: ‘Abstract opinion versus Concrete description/summary’ 

 
     Features with positive loadings: 
    word length (.49) 
    nouns: nominalizations (.62), mental nouns (.51), abstract nouns (.38) 
    noun + to -complement clause (.33)  
    mental verbs (.31) 
     Features with negative loadings: 
    concrete nouns (−.38), activity verbs (−.47) 

 
Dimension 4: ‘Personal narration’ 

 
     Features with positive loadings: 
    1 st person pronouns (.35), past tense verbs (.74)  
     Features with negative loadings: 
    present tense verbs (−.70) 
Table 5:

Summary of the important linguistic features loading on dimensions 1–4


 
Dimension 1: ‘Literate’ versus ‘Oral’ responses 

 
     Features with positive loadings: 
    nouns: common nouns (.64), concrete nouns (.64), premodifying nouns (.39) 
    prepositional phrases (.52), noun + of -phrase (.47)  
    adjectives: attributive (.61), topical (.40) 
    word length (.40) 
    passives: finite (.41), postnominal (.32) 
     Features with negative loadings: 
    verbs: present tense (−.33), mental verbs (−.62), modal verbs (−.36) 
    pronouns: 3 rd person (−.55)  
     that -clauses: controlled by likelihood verbs (−.45), that -omission (−.48)  
    finite adverbial clauses (−.31) 

 
Dimension 2: ‘Information source: Text versus personal experience’ 

 
     Features with positive loadings: 
    nouns (.37), place nouns (.45), premodifying nouns (.39) 
    3 rd person pronouns (.41)  
     that -clauses controlled by communication verbs (.68)  
    communication verbs (.80) 
     Features with negative loadings: 
    pronouns: 1 st person (−.33), 2 nd person (−.39)  
    abstract nouns (−.37) 

 
Dimension 3: ‘Abstract opinion versus Concrete description/summary’ 

 
     Features with positive loadings: 
    word length (.49) 
    nouns: nominalizations (.62), mental nouns (.51), abstract nouns (.38) 
    noun + to -complement clause (.33)  
    mental verbs (.31) 
     Features with negative loadings: 
    concrete nouns (−.38), activity verbs (−.47) 

 
Dimension 4: ‘Personal narration’ 

 
     Features with positive loadings: 
    1 st person pronouns (.35), past tense verbs (.74)  
     Features with negative loadings: 
    present tense verbs (−.70) 

 
Dimension 1: ‘Literate’ versus ‘Oral’ responses 

 
     Features with positive loadings: 
    nouns: common nouns (.64), concrete nouns (.64), premodifying nouns (.39) 
    prepositional phrases (.52), noun + of -phrase (.47)  
    adjectives: attributive (.61), topical (.40) 
    word length (.40) 
    passives: finite (.41), postnominal (.32) 
     Features with negative loadings: 
    verbs: present tense (−.33), mental verbs (−.62), modal verbs (−.36) 
    pronouns: 3 rd person (−.55)  
     that -clauses: controlled by likelihood verbs (−.45), that -omission (−.48)  
    finite adverbial clauses (−.31) 

 
Dimension 2: ‘Information source: Text versus personal experience’ 

 
     Features with positive loadings: 
    nouns (.37), place nouns (.45), premodifying nouns (.39) 
    3 rd person pronouns (.41)  
     that -clauses controlled by communication verbs (.68)  
    communication verbs (.80) 
     Features with negative loadings: 
    pronouns: 1 st person (−.33), 2 nd person (−.39)  
    abstract nouns (−.37) 

 
Dimension 3: ‘Abstract opinion versus Concrete description/summary’ 

 
     Features with positive loadings: 
    word length (.49) 
    nouns: nominalizations (.62), mental nouns (.51), abstract nouns (.38) 
    noun + to -complement clause (.33)  
    mental verbs (.31) 
     Features with negative loadings: 
    concrete nouns (−.38), activity verbs (−.47) 

 
Dimension 4: ‘Personal narration’ 

 
     Features with positive loadings: 
    1 st person pronouns (.35), past tense verbs (.74)  
     Features with negative loadings: 
    present tense verbs (−.70) 

The underlying assumption of MD analysis is that linguistic co-occurrence patterns have a functional basis: linguistic features occur together in texts because they serve related communicative functions. The dimensions are therefore interpreted in functional terms, based on (i) analysis of the communicative function(s) most widely shared by the set of co-occurring features, and (ii) analysis of the similarities and differences among registers with respect to the dimension. In the present case, the following functional labels are proposed: Factor scores for each text are computed by summing the rates of occurrence of the features having salient loadings on that factor. (The rates of occurrence are standardized before computing factor scores, so that all linguistic features have the same scale, with an overall corpus mean score = 0.0, and units of ±1 representing 1 standard deviation.)

  • Dimension 1: ‘Literate’ versus ‘Oral’ responses

  • Dimension 2: ‘Information source: Text versus personal experience’

  • Dimension 3: ‘Abstract opinion versus Concrete descriptions/summary’

  • Dimension 4: ‘Personal narration’

As Table 6 shows, all four dimensions are significant and strong predictors of differences among the TOEFL iBT task types; the GLM models for three of the four dimensions have r2 values of c. 65 per cent, while the fourth dimension has an r2 value of almost 50 per cent. Mode (speech versus writing) and task (independent versus integrated) are significant factors for all four dimensions. Score level has a weaker relationship to these linguistic dimensions, but it is a significant predictor for Dimension 1, and significant in interaction with mode or task for Dimensions 1, 2, and 4. The discussion below focuses on Dimension 1 because it is defined by several co-occurring linguistic complexity features and is the strongest predictor of both score-level as well as mode/task differences.

Table 6:

Summary of the full factorial models for dimensions 1–4

Dimension Model R2  Mode a (SP/WR)  Task Score level Mode × Task Mode × Score Task × Score Mode × Task × Score Test taker 
Dimension 1: ‘Literate versus Oral responses’ <.0001 .685 <.0001 <.0001 <.01 <.0001 ns <.05 ns <.0001 
Dimension 2: ‘Information source’ <.0001 .678 <.0001 <.0001 ns <.0001 ns <.01 ns ns 
Dimension 3: ‘Abstract versus Concrete’ <.0001 .654 <.0001 <.0001 ns <.0001 ns ns ns <.01 
Dimension 4: ‘Personal narration’ <.0001 .485 <.05 <.0001 ns <.0001 ns ns <.01 ns 
Dimension Model R2  Mode a (SP/WR)  Task Score level Mode × Task Mode × Score Task × Score Mode × Task × Score Test taker 
Dimension 1: ‘Literate versus Oral responses’ <.0001 .685 <.0001 <.0001 <.01 <.0001 ns <.05 ns <.0001 
Dimension 2: ‘Information source’ <.0001 .678 <.0001 <.0001 ns <.0001 ns <.01 ns ns 
Dimension 3: ‘Abstract versus Concrete’ <.0001 .654 <.0001 <.0001 ns <.0001 ns ns ns <.01 
Dimension 4: ‘Personal narration’ <.0001 .485 <.05 <.0001 ns <.0001 ns ns <.01 ns 

a SP = Speaking; WR = Writing.

Table 6:

Summary of the full factorial models for dimensions 1–4

Dimension Model R2  Mode a (SP/WR)  Task Score level Mode × Task Mode × Score Task × Score Mode × Task × Score Test taker 
Dimension 1: ‘Literate versus Oral responses’ <.0001 .685 <.0001 <.0001 <.01 <.0001 ns <.05 ns <.0001 
Dimension 2: ‘Information source’ <.0001 .678 <.0001 <.0001 ns <.0001 ns <.01 ns ns 
Dimension 3: ‘Abstract versus Concrete’ <.0001 .654 <.0001 <.0001 ns <.0001 ns ns ns <.01 
Dimension 4: ‘Personal narration’ <.0001 .485 <.05 <.0001 ns <.0001 ns ns <.01 ns 
Dimension Model R2  Mode a (SP/WR)  Task Score level Mode × Task Mode × Score Task × Score Mode × Task × Score Test taker 
Dimension 1: ‘Literate versus Oral responses’ <.0001 .685 <.0001 <.0001 <.01 <.0001 ns <.05 ns <.0001 
Dimension 2: ‘Information source’ <.0001 .678 <.0001 <.0001 ns <.0001 ns <.01 ns ns 
Dimension 3: ‘Abstract versus Concrete’ <.0001 .654 <.0001 <.0001 ns <.0001 ns ns ns <.01 
Dimension 4: ‘Personal narration’ <.0001 .485 <.05 <.0001 ns <.0001 ns ns <.01 ns 

a SP = Speaking; WR = Writing.

Figure 1 plots the mean scores for each TOEFL iBT text category with respect to Dimension 1. This dimension is easy to interpret functionally, because it is so similar to Dimension 1 in previous MD studies of other discourse domains ( Biber 1988 , 2006 , 2014 ). Dimension 1 is composed of both ‘positive’ and ‘negative’ features. The two groupings constitute a single dimension because they occur in complementary distribution: when the positive features occur with a high frequency in a text, that same text will have a low frequency of negative features, and vice versa. Consideration of both the co-occurring linguistic features that define this dimension, together with the distribution of text categories shown in Figure 1 , leads to the interpretive label ‘Literate versus oral tasks’ for Dimension 1.

Figure 1:

Dimension 1 scores across task types and levels

Figure 1:

Dimension 1 scores across task types and levels

The positive features on Dimension 1 are mostly nouns and other phrasal complexity features used to modify noun phrases (i.e. nouns premodifying a head noun, attributive adjectives, of -phrases, and other prepositional phrases). These features co-occur with long words and passive constructions. A similar grouping of features has been found in nearly all previous MD studies, associated with written registers having informational purposes. In contrast, the negative features on Dimension 1 are verbs, pronouns, and finite dependent clause structures. In previous MD studies, such features have been associated with speech and with registers having involved/personal communicative purposes.

As noted above, Dimension 1 is a statistically significant predictor of both task-type and score-level differences (see Table 6 ). Figure 1 presents the results of post hoc comparisons, plotting the mean scores for each task type and score level. This plot shows that there is a highly systematic relationship between Dimension 1 scores and all situational parameters of variation on the TOEFL iBT: The written mode offers the most opportunity for careful production (including revision and editing), permitting the use of a nominal/phrasal discourse style. Integrated tasks are more informational in purpose, making extensive reference to background information (i.e. the reading and listening passages that students comprehend before text production). As a result, integrated tasks rely on more complex ‘literate’ (phrasal) grammatical characteristics. Interestingly, this pattern holds for both written and spoken task types.

  • written responses are consistently more ‘literate’ than spoken responses (shown by larger positive Dimension 1 scores);

  • within each mode, integrated responses are consistently more ‘literate’ than independent responses;

  • and within each mode/task category, higher scoring responses are consistently more ‘literate’ than lower scoring responses. 7

Dimension 1 also predicts highly systematic differences across TOEFL iBT score levels, showing that raters are responsive to these same discourse characteristics. That is, within each task type, Level 4 responses are the most ‘literate’ in their Dimension 1 scores, while Level 1 and Level 2 responses are the most ‘oral’. Figure 1 shows that this pattern holds consistently for all four task types. Correlations between Dimension 1 values and TOEFL iBT score level, carried out separately for each task type, are all significant at p < .01 (see Table 7 ). At the same time, Table 7 shows that these correlations are not especially strong. This latter finding indicates that there is considerable variation in grammatical style among texts that have been assigned the same TOEFL iBT score, probably reflecting the fact that TOEFL raters must consider much more than grammatical complexity features when assigning holistic ratings (see discussion in Section 3 above). In ongoing research, we are utilizing cluster analysis to explore groupings of exam responses defined by their similar use of linguistic complexity features (see Staples and Biber to appear ).

Table 7:

Pearson correlations between Dimension 1 scores and TOEFL iBT score level, for each task type

Task type r p r2 (per cent)  
Spoken-independent .19 <.01 3.6 
Spoken-integrated .16 <.01 2.6 
Written-independent .39 <.01 15.2 
Written-integrated .13 <.01 1.7 
Task type r p r2 (per cent)  
Spoken-independent .19 <.01 3.6 
Spoken-integrated .16 <.01 2.6 
Written-independent .39 <.01 15.2 
Written-integrated .13 <.01 1.7 
Table 7:

Pearson correlations between Dimension 1 scores and TOEFL iBT score level, for each task type

Task type r p r2 (per cent)  
Spoken-independent .19 <.01 3.6 
Spoken-integrated .16 <.01 2.6 
Written-independent .39 <.01 15.2 
Written-integrated .13 <.01 1.7 
Task type r p r2 (per cent)  
Spoken-independent .19 <.01 3.6 
Spoken-integrated .16 <.01 2.6 
Written-independent .39 <.01 15.2 
Written-integrated .13 <.01 1.7 
Table 8:

MANOVA results for mode, task, and score level for the 23 complexity features

 Wilks’ lambda value F value Num DF Den DF Pr > F 
Mode 0.35743631 183.68 23 2350 <.0001 
Task type 0.50377990 100.64 23 2350 <.0001 
Score Level 0.84527341 5.89 69 7021.4 <.0001 
 Wilks’ lambda value F value Num DF Den DF Pr > F 
Mode 0.35743631 183.68 23 2350 <.0001 
Task type 0.50377990 100.64 23 2350 <.0001 
Score Level 0.84527341 5.89 69 7021.4 <.0001 
Table 8:

MANOVA results for mode, task, and score level for the 23 complexity features

 Wilks’ lambda value F value Num DF Den DF Pr > F 
Mode 0.35743631 183.68 23 2350 <.0001 
Task type 0.50377990 100.64 23 2350 <.0001 
Score Level 0.84527341 5.89 69 7021.4 <.0001 
 Wilks’ lambda value F value Num DF Den DF Pr > F 
Mode 0.35743631 183.68 23 2350 <.0001 
Task type 0.50377990 100.64 23 2350 <.0001 
Score Level 0.84527341 5.89 69 7021.4 <.0001 

In summary, the results of the MD analysis show that holistic measures determined through empirical corpus analysis—the dimensions—can be used to predict differences among task types and, to a lesser extent, exam score levels. These dimensions are motivated linguistically (based on a full set of linguistic complexity features) and functionally (interpreted by reference to previous research on spoken/written registers). As such, they provide an important alternative approach for assessing the patterns of linguistic variation in the context of standardized language exams.

5. DISCUSSION AND CONCLUSION

The present study has shown that task-type differences on standardized language exams—associated with both speech versus writing and with different communicative purposes—are systematically associated with linguistic differences, especially with the use of grammatical complexity features. These findings extend previous research on task-type variation in several respects.

First, they show that task-type differences are predictable, corresponding to the patterns of linguistic variation documented in previous research on spoken and written registers. However, they also show the importance of considering a wide range of linguistic complexity features, because different features function in different ways, and are therefore associated with different register patterns. In particular, phrasal noun modifiers (e.g. premodifying nouns, prepositional phrases, genitive of -phrases, and attributive adjectives) function very differently from clausal complexity features, and as a result, they are much more strongly associated with informational, written task types. In contrast, clausal complexity features (e.g. adverbial clauses and verb complement clauses) tend to be more strongly associated with personal, spoken task types.

These findings generally support the approach taken by the TOEFL iBT and other standard language exams, asking test takers to display language abilities in multiple spoken and written tasks. The patterns of variation documented here for spoken/written independent/integrated tasks parallel the patterns found for spoken and written registers of university language ( Biber 2006 ), supporting the first two propositions of test validation listed in Enright and Tyson ( 2008 : Table 1 ): ‘The content of the test is relevant to and representative of the kinds of tasks and written and oral texts that students encounter in college and university settings’, and ‘tasks … are appropriate for obtaining evidence of test takers’ academic language abilities’. More generally, as Bachman and Palmer ( 1996 : 10) point out:

If we want to use the scores from a language test to make inferences about individuals’ language ability, and possibly to make various types of decisions, we must be able to demonstrate how performance on that language test is related to language use in specific situations other than the language test itself … That is, we need a framework that enables us to use the same characteristics to describe what we believe are the critical features of both language test performance and non-test language use.

The findings from the present study provide such a framework, showing that the patterns of linguistic variation documented for non-test registers can be used to predict important linguistic differences among the spoken/written task types on the TOEFL iBT.

From a linguistic perspective, the analyses here confirm previous research showing that different structural types of dependent clauses function in different ways, and thus they have different distributions across task types. In particular, finite adverbial clauses and complement clauses are associated with personal, spoken task types, while non-finite relative clauses and phrasal noun modifiers are associated with informational, written registers. These findings strongly suggest that future research on grammatical complexity in applied linguistics should be modified in two respects: Finally, the analyses here provide strong support for the expectation that analyses based on linguistic co-occurrence patterns will be more robust than analyses based on consideration of individual linguistic features. This pattern holds for both the description of task types (spoken versus written, and independent versus integrated) as well as exam score-level differences. With respect to task-type differences, the MD analysis identified highly systematic patterns of variation, with a sharp distinction between spoken versus written responses, as well as systematic differences between independent versus integrated tasks within each mode. The findings for proficiency differences were perhaps even more persuasive: while analyses based on individual complexity features generally failed to identify significant linguistic differences across score levels, Dimension 1 of the MD analysis identified a highly consistent pattern within each task type, with higher score levels using more of the ‘literate’ features associated with Dimension 1.

  1. Phrasal complexity features should always be considered in addition to clausal complexity features, because phrasal devices are actually more indicative of informational written task types than dependent clause measures.

  2. Measures of complexity should differentiate among the structural types and syntactic functions of dependent clauses and phrases, because there is no theoretical or empirical evidence to support the use of an holistic measure based on overall length (i.e. disregarding the distinctive discourse functions of the different types).

Based on these results, we are able to recommend the dimensions derived from MD analysis as a productive alternative to T-unit-based measures of complexity. The two approaches are similar in seeking a few holistic measures to represent the entire system of grammatical complexity. However, they differ radically in their linguistic and functional bases. Dimensions are based on analysis of individual linguistic complexity features, giving full weight to all structural and functional distinctions found in the grammatical system of English. Then, these linguistic features are grouped into a few holistic dimensions based on empirical corpus-based analysis to determine the sets of features that regularly co-occur in texts. In addition to having a more solid linguistic and empirical foundation than T-unit-based measures, the analyses reported in Section 4 further show that the dimensions are highly effective as predictors of linguistic variation among both task types and score levels on standardized exams. Taken together, these considerations provide a strong argument for the utilization of MD analyses in future investigations of task-type variation.

In sum, we have painted a complex portrait of intersecting influences on learner language production, including the production mode, the communicative purposes of the task, the score level of the exam responses, and the grammatical complexity characteristics focused on in the analysis. By considering all of these together, future research should provide much more insightful descriptions of complexity in learner discourse across task types than is possible with studies restricted to a small sector of this domain of use.

ACKNOWLEDGEMENTS

This project was funded by the TOEFL® Committee of Examiners and the TOEFL program at ETS.

Conflict of interest statement. None declared .

NOTES

1 Although Jarvis et al. (2003) advocate the importance of describing proficiency differences based on the co-occurrence of linguistic features, they do not actually carry out analyses of this type. Rather, they employ a cluster analysis to group together student essays with similar linguistic characteristics, and then carry out a post hoc analysis of the grammatical characteristics prevalent in each of those clusters. This approach allows inferences concerning the sets of linguistic features found in different writing styles, but there is no direct analysis of linguistic co-occurrence. That is, cluster analysis groups observations (i.e. texts or students), not variables. In contrast, factor analysis (the statistical technique used in MD analysis) groups variables (i.e. linguistic features that co-occur).
2 Interestingly, Hunt (1965) —who first proposed the use of T-unit measures—was well aware of the need for specific measures of grammatical complexity that are well-motivated from a linguistic perspective. Thus, Hunt included analyses of the depth of embedding for dependent clauses, as well as separate detailed analyses of complement clauses, relative clauses, and adverbial clauses (Chapter 5) as well as noun phrases and prepositional phrases as adverbials and noun modifiers (Chapter 6).
3 One spoken response for one test taker was missing from the data set. In most cases, the data set does not actually include complete exams (with all eight responses from a test taker), and thus there are 820 different test takers represented in our sample.
4 All texts for the Level 1 spoken tasks were excluded due to their short length.
5 As a preliminary step in the statistical analysis, MANOVAs were carried out, clearly rejecting the hypotheses of no overall effect for each independent variable predicting the 23 complexity features. See Table 8 on the next page for the results of the MANOVA.
6 This percentage is similar to the rates for other factor analyses of registers (e.g. the seven factors in Biber 1988 accounted for 51.9% of the shared variance).
7 The only exception to these patterns is the Dimension 1 value for Written Integrated Score Level 2 texts, which is slightly lower than the value for Written Integrated Score Level 1 texts.

REFERENCES

Bachman
L F
Palmer
A S
Language Testing in Practice
 , 
1996
Oxford University Press
Banerjee
J
Franceschina
F
Smith
A M
Documenting features of written language production typical at different IELTS band score levels
IELTS Research Reports
 , 
2007
, vol. 
7
 (pg. 
249
-
309
)
Beers
S F
Nagy
W E
Syntactic complexity as a predictor of adolescent writing quality: Which measures? Which genre?
Reading and Writing
 , 
2009
, vol. 
22
 (pg. 
185
-
200
)
Biber
D
Variation across Speech and Writing
 , 
1988
Cambridge University Press
Biber
D
Dimensions of Register Variation: A Cross-Linguistic Comparison
 , 
1995
Cambridge University Press
Biber
D
University Language: A Corpus-based Study of Spoken and Written Registers
 , 
2006
John Benjamins
Biber
D
Olson
D R
Torrance
N
Are there linguistic consequences of literacy? Comparing the potentials of language use in speech and writing
Cambridge Handbook of Literacy
 , 
2009
Cambridge University Press
(pg. 
75
-
91
)
Biber
D
Using multi-dimensional analysis to explore cross-linguistic universals of register variation
Languages in Contrast
 , 
2014
, vol. 
14
 (pg. 
7
-
34
)
Biber
D
Conrad
S
Register, Genre, and Style
 , 
2009
Cambridge University Press
Biber
D
Gray
B
Challenging stereotypes about academic writing: Complexity, elaboration, explicitness
Journal of English for Academic Purposes
 , 
2010
, vol. 
9
 (pg. 
2
-
20
)
Biber
D
Gray
B
Discourse Characteristics of Writing and Speaking Task Types on the TOEFL iBT Test: A Lexico-grammatical Analysis
 , 
2013
 
TOEFL iBT Research Report (TOEFL iBT-19). Educational Testing Service
Biber
D
Gray
B
Poonpon
K
Should we use the characteristics of conversation to measure grammatical complexity in L2 writing development?
TESOL Quarterly
 , 
2011
, vol. 
45
 (pg. 
5
-
35
)
Biber
D
Gray
B
Poonpon
K
Pay attention to the phrasal structures: going beyond T-units—a response to WeiWei Yang
TESOL Quarterly
 , 
2013
, vol. 
47
 (pg. 
192
-
201
)
Biber
D
Johansson
S
Leech
G
Conrad
S
Finegan
E
Longman Grammar of Spoken and Written English
 , 
1999
Longman
Brown
A
Iwashita
N
McNamara
T
An Examination of Rater Orientations and Test-taker Performance on English-for-Academic-Purposes Speaking Tasks
 , 
2005
 
TOEFL Monograph Series MS-29. Educational Testing Service
Byrnes
H
Maxim
H
Norris
J
Realizing advanced L2 writing development in a collegiate curriculum: curricular design, pedagogy, assessment
Modern Language Journal
 , 
2010
, vol. 
94
  
(Monograph Issue): 1–221
Chapelle
C A
Enright
M K
Jamieson
J M
Building a Validity Argument for the Test of English as a Foreign Language
 , 
2008
Routledge
Cumming
A
Kantor
R
Baba
K
Erdosy
U
Eouanzoui
K
James
M
Differences in written discourse in independent and integrated prototype tasks for next generation TOEFL
Assessing Writing
 , 
2005
, vol. 
10
 (pg. 
5
-
43
)
Cumming
A
Kantor
R
Baba
K
Erdosy
U
Eouanzoui
K
James
M
Analysis of Discourse Features and Verification of Scoring Levels for Independent and Integrated Tasks for the new TOEFL
 , 
2006
 
(TOEFL Monograph No.MS-30 RM 05-13) Educational Testing Service
Ellis
R
The differential effects of three types of task planning on the fluency, complexity, and accuracy in L2 oral production
Applied Linguistics
 , 
2009
, vol. 
19
 (pg. 
474
-
509
)
Ellis
R
Yuan
F
Ellis
R
The effects of careful within-task planning on oral and written task performance
Planning and Task Performance in a Second Language
 , 
2005
John Benjamins
(pg. 
168
-
92
)
Enright
M
Tyson
E
Validity Evidence Supporting the Interpretation and Use of TOEFL iBT scores
 , 
2008
 
TOEFL iBT Research Insight
Ervin-Tripp
S
Gumperz
J
Hymes
D
On sociolinguistic rules: Alternation and co-occurrence'
Directions in Sociolinguistics
 , 
1972
Holt
(pg. 
213
-
50
)
Friginal
E
Twenty-five years of Biber’s multi-dimensional analysis: introduction to the special issue and an interview with Douglas Biber
Corpora
 , 
2013
, vol. 
8
 (pg. 
137
-
52
)
Friginal
E
Li
M
Weigle
S
Revisiting multiple profiles of learner compositions: A comparison of highly rated NS and NNS essays
Journal of Second Language Writing
 , 
2014
, vol. 
23
 (pg. 
1
-
16
)
Jarvis
S
Grant
L
Bikowski
D
Ferris
D
Exploring multiple profiles of highly rated learner composition
Journal of Second Language Writing
 , 
2003
, vol. 
12
 (pg. 
377
-
403
)
Halliday
M A K
Spoken and Written Language
 , 
1989
Oxford University Press
Halliday
M A K
The Language of Science
 , 
2004
Continuum
Housen
A
Kuiken
F
Vedder
I
Dimensions of L2 Performance and Proficiency: Complexity, Accuracy and Fluency in SLA
 , 
2012
John Benjamins
Hunt
K W
Grammatical Structures Written at Three Grade Levels
 , 
1965
 
National Council of Teachers of English
Kormos
J
Trebits
A
The role of task complexity, modality, and aptitude in narrative task performance
Language Learning
 , 
2012
, vol. 
62
 (pg. 
439
-
72
)
Kuiken
F
Vedder
I
Housen
A
Kuiken
F
Vedder
I
Syntactic complexity, lexical variation and accuracy as a function of task complexity and proficiency level in L2 writing and speaking
Dimensions of L2 Performance and Proficiency: Complexity, Accuracy and Fluency in SLA
 , 
2012
John Benjamins
(pg. 
143
-
70
)
Lu
X
A corpus-based evaluation of syntactic complexity measures as indices of college-level ESL writer’s language development
TESOL Quarterly
 , 
2011
, vol. 
45
 (pg. 
36
-
61
)
Nichols
J
Functional theories of grammar
Annual Review of Anthropology
 , 
1984
, vol. 
13
 (pg. 
97
-
117
)
Norris
J
Ortega
L
Towards and organic approach to investigating CAF in SLA: the case of complexity
Applied Linguistics
 , 
2009
, vol. 
30
 (pg. 
555
-
78
)
Ortega
L
Syntactic complexity measures and their relationship to L2 proficiency: a research synthesis of college-level L2 writing
Applied Linguistics
 , 
2003
, vol. 
24
 (pg. 
492
-
518
)
Parkinson
J
Musgrave
J
Development of noun phrase complexity in the writing of English for Academic Purposes students
Journal of English for Academic Purposes
 , 
2014
, vol. 
14
 (pg. 
48
-
59
)
Robinson
P
Task complexity, task difficulty, and task production: exploring interactions in a componential framework
Applied Linguistics
 , 
2001
, vol. 
22
 (pg. 
27
-
57
)
Robinson
P
Task complexity, theory of mind, and intentional reasoning: effects on L2 speech production, interaction, uptake and perceptions of task difficulty
International Review of Applied Linguistics
 , 
2007
, vol. 
45
 (pg. 
193
-
213
)
Skehan
P
A Cognitive Approach to Language Learning
 , 
1998
Oxford University Press
Staples
S
Biber
D
Plonsky
L
Cluster analysis
Advancing Quantitative Methods in Second Language Research
 , 
to appear
Routledge
Tavakoli
P
Foster
P
Task design and second language performance: The effect of narrative type on learner output
Language Learning
 , 
2008
, vol. 
58
 (pg. 
439
-
73
)
Taguchi
N
Crawford
W
Wetzel
D Z
What linguistic features are indicative of writing quality? A case of argumentative essays in a college composition program
TESOL Quarterly
 , 
2013
, vol. 
47
 (pg. 
420
-
30
)
Way
D P
Joiner
E G
Seaman
M A
Writing in the secondary foreign language classroom: the effects of prompts and tasks on novice learners of French
Modern Language Journal
 , 
2000
, vol. 
84
 (pg. 
171
-
84
)
Yuan
F
Ellis
R
The effects of pre-task planning and on-line planning on fluency, complexity and accuracy in L2 monologic oral production
Applied Linguistics
 , 
2003
, vol. 
24
 (pg. 
1
-
27
)
Zheng
Y
De Jong
J H A L
Establishing construct and concurrent validity of Pearson Test of English Academic
2011