315

(Language Learning & Language Teaching, 9) Norbert Schmitt-Formulaic Sequences_ Acquisition, Processing and Use (Language Learning & Language Teaching)-John Benjamins Pub Co (2004)

Embed Size (px)

DESCRIPTION

(Language Learning & Language Teaching, 9) Norbert Schmitt-Formulaic Sequences_ Acquisition

Citation preview

  • Formulaic Sequences

  • Language Learning and Language Teaching

    The LL&LT monograph series publishes monographs as well as edited volumeson applied and methodological issues in the eld of language pedagogy. Thefocus of the series is on subjects such as classroom discourse and interaction;language diversity in educational settings; bilingual education; language testingand language assessment; teaching methods and teaching performance; learningtrajectories in second language acquisition; and written language learning ineducational settings.

    Series editors

    Birgit HarleyOntario Institute for Studies in Education, University of Toronto

    Jan H. HulstijnDepartment of Second Language Acquisition, University of Amsterdam

    Volume 9

    Formulaic Sequences: Acquisition, processing and useEdited by Norbert Schmitt

  • Formulaic SequencesAcquisition, processing and use

    Edited by

    Norbert SchmittUniversity of Nottingham

    John Benjamins Publishing Company

    Amsterdam/Philadelphia

  • The paper used in this publication meets the minimum requirements8 TM of American National Standard for Information Sciences Permanenceof Paper for Printed Library Materials, ansi z39.48-1984.

    Library of Congress Cataloging-in-Publication Data

    Formulaic sequences : acquisition, processing and use / edited by NorbertSchmitt.

    p. cm. (Language Learning and Language Teaching, issn 15699471; v. 9)

    Includes bibliographical references and indexes.1. Language and languages--Study and teaching. 2. Lexicology. 3.

    Pattern perception. I. Schmitt, Norbert, 1956- II. Series.

    P53. F654 2004407-dc22 2004041065isbn 90 272 1707 6 (Eur.) / 1 58811 499 6 (US) (Hb; alk. paper)isbn 90 272 1708 4 (Eur.) / 1 58811 500 3 (US) (Pb; alk. paper)

    2004 John Benjamins B.V.No part of this book may be reproduced in any form, by print, photoprint, microlm, orany other means, without written permission from the publisher.

    John Benjamins Publishing Co. P.O. Box 36224 1020 me Amsterdam The NetherlandsJohn Benjamins North America P.O. Box 27519 Philadelphia pa 19118-0519 usa

  • Contents

    Preface viii

    Formulaic sequences in action: An introduction Norbert Schmitt and Ronald Carter

    Measurement of formulaic sequences 23John Read and Paul Nation

    Formulaic performance in conventionalised varieties of speech 37Koenraad Kuiper

    Knowledge and acquisition of formulaic sequences: A longitudinal study 55Norbert Schmitt, Zoltn Drnyei, Svenja Adolphs, and Valerie Durow

    Individual dierences and their eects on formulaic sequence acquisition 87Zoltn Drnyei, Valerie Durow, and Khawla Zahran

    Social-cultural integration and the development of formulaic sequences 07Svenja Adolphs and Valerie Durow

    Are corpus-derived recurrent clusters psycholinguistically valid? 27Norbert Schmitt, Sarah Grandage, and Svenja Adolphs

    The eyes have it: An eye-movement study into the processing of formulaic sequences 53

    Georey Underwood, Norbert Schmitt, and Adam Galpin

    Exploring the processing of formulaic sequences through a self-paced reading task 73

    Norbert Schmitt and Geoery Underwood

    Comparing knowledge of formulaic sequences across L1, L2, L3, and L4 9Carol Spttl and Michael McCarthy

    The eect of typographic salience on the look up and comprehension of unknown formulaic sequences 227

    Hugh Bishop

  • vi

    Heres one I prepared earlier: Formulaic language learning on television 249Alison Wray

    Facilitating the acquisition of formulaic sequences: An exploratory study in an EAP context 269

    Martha Jones and Sandra Haywood

    Index 30

    Contents

  • To my colleagues at the University of Nottingham

  • Preface

    Lexical patterning is an increasingly important issue in applied linguistics as it becomes ever more apparent that such patterning pervades most language use. This is not a new insight, with numerous scholars referring to such patterning over the years. However these scholars have used a wide range of terminology for the phenomenon, and the research has been scattered across various elds. This led to a quite limited awareness of lexical patterning in the applied linguis-tics eld in general, and it was only relatively recently that the eorts of scholars like Nattinger and DeCarrico, Sinclair, Moon, Kuiper, Wray, and Biber have led to it becoming more widely known.

    A considerable amount of the research has attempted to describe the na-ture of various lexical patterns (idioms, collocations, sentence stems, etc.), often based on corpus evidence. Other research has looked at the role of formulaic patterns in the acquisition of rst language. Beyond this, there is little research which has focused on lexical patterns in second language acquisition, or on the whole issue of how lexical patterns are processed in the mind. The time seemed ripe for research addressing these areas.

    A team at the Centre for Research in Applied Linguistics (CRAL) at the Uni-versity of Nottingham was able to carry out a cycle of research into lexical pat-terning, and this volume reports on our ndings. During our investigations, we became aware that other lexically-minded scholars around the world were con-currently carrying out studies in the same area, and some of their work is also included in this book. As a package, we feel that the studies in this volume are not only interesting in terms of their ndings, but also in terms of variety of methodology used. We have included the full research instrumentation wher-ever possible for the interested reader.

    I would like to thank several people for making this volume possible. Zoltn Drnyei, my co-director at CRAL, generated the grant that funded the whole process, and was there through all of the ups and downs of the research. Svenja Adolphs, Valerie Durow, Sarah Grandage and Khawla Zahran were the other core team members without whom nothing would have happened. Colleagues at the Centre for English Language Education (CELE) at the University of Not-tingham allowed access to their students, and I would like to particularly thank

  • ix

    Rebecca Hughes, Martha Jones, and Sandra Haywood. Georey Underwood was a most helpful collaborator who helped open up exciting new methodolo-gies in the study of formulaic sequences. I am grateful to non-CRAL colleagues who have contributed welcome additions to the book: Hugh Bishop, Koenraad Kuiper, Paul Nation, John Read, Carol Spttl, and Alison Wray. In particular, I would like to thank Alison Wray and Koenraad Kuiper for their very insight-ful input, which improved the entire project immensely. Jan Hulstijn and Birgit Harley proved to be supportive and insightful series editors and it is a pleasure to have this volume in their series. Kees Vaes was a most friendly and ecient liaison at John Benjamins Publishing. The Economic and Social Research Coun-cil supported the research with Grant #R000239294.

    I have enjoyed being part of this research, and hope that you nd much of interest in these studies. If you become interested in researching this area your-self, all the better. Many of these studies are innovative now, but it would be wonderful if we could look back in ten years and marvel at how much we had progressed.

    Norbert SchmittUniversity of NottinghamNovember 2003

    Preface

  • Formulaic sequences in actionAn introduction

    Norbert Schmitt and Ronald CarterUniversity of Nottingham

    Introduction

    Formulaic sequences are ubiquitous in language use (Nattinger and DeCarrico, 1992: 66) and they make up a large proportion of any discourse. Erman and Warren (2000) calculated that formulaic sequences of various types constituted 58.6% of the spoken English discourse they analyzed and 52.3% of the written discourse. Using dierent criteria and procedures, Fosters raters judged that 32.3% of the unplanned native speech they analyzed was made up of formulaic language (Foster, 2001). If formulaic sequences are so widespread in English discourse, it follows that procient English speakers must have knowledge and mastery of these sequences at some level. A number of scholars claim that this knowledge is extensive. For example, Pawley and Syder (1983: 213) suggest that the number of sentence-length expressions familiar to the ordinary, mature English speaker probably amounts, at least, to several hundreds of thousands. Jackendo (1995) concludes from a small corpus study of spoken language in a TV quiz show that formulaic sequences may be of equal if not greater signi-cance than the lexicon of single words, while Meluk (1995), who uses the term phraseology, claims even greater overall signicance for such sequences. The idea that procient language users know numerous formulaic sequences is in-tuitive, but it must be said that the above claims are made by assertion, as there is little empirical work to substantiate them. However, they do t well with Sin-clairs (1991) view that language as a whole is organised according to two main structuring principles: an open choice principle and an idiom principle, with the latter involving the widespread use of formulaic stretches of words.1 Further-more, this store of formulaic sequences is dynamic and is constantly changing to meet the needs of the speaker (Wray, 2002: 101). Even if the above claims prove to be somewhat overstated, it is clear that lexical patterning does exist in

  • 2 Norbert Schmitt and Ronald Carter

    English, and therefore must have some consequences in terms of how English is acquired, processed, and used.

    Some types of formulaic sequence have always been obvious in the form of idioms, proverbs, and sayings. These sequences noticeably operate as single units at some level, even though their form consists of multiple orthographic words. The fact that these multi-word units express a single meaning made them stand out. In the case of idioms, their meaning could not be derived from the sum of meanings of the component words and they did not always follow the rules of grammar. These multiword units were often relegated to a peripheral category by scholars; acknowledged, but dismissed as having only a minor role in language (see Wray, 2002). The advent of computerized corpus studies made additional patterning evident, and it soon became clear that lexical patterning was not limited to these obvious multiword units (e.g. Biber et al. 1999).2

    In fact, formulaic sequences seem to exist in so many forms that it is pres-ently dicult to develop a comprehensive denition of the phenomenon. This lack of a clear denition remains one of the foremost problems in the area. Some commonly-used criteria come from the area of corpus linguistics, such as in-stitutionalization, xedness, and non-compositionality, which Moon (1997: 44) suggests are key characteristics of what she calls multi-word items. Another often-cited criterion is frequency of occurrence, on the assumption that if a se-quence is frequent in a corpus, this indicates that it is conventionalised by the speech community, at least to some extent. In general, corpus denitions are concerned with identifying and describing formulaic sequences as they occur throughout a corpus.

    These criteria are useful, but are not the only possible way to view formu-laic sequences. Psycholinguists and language acquisition specialists focus on criteria which determine whether sequences are known by individual partici-pants, and whether these sequences are formulaic and stored as wholes in the participants mental lexicon. Thus criteria are used such as whether a sequence of words is produced more than once by a participant (indicating that the se-quence is known and not just a one-o imitation of a sequence heard by the par-ticipant) and whether it is produced with an intact intonation contour (suggest-ing the sequence is stored as a whole).

    Although linguistic and psycholinguistic criteria have been developed for dierent purposes, any satisfying description of formulaic sequences probably needs to draw on both perspectives. Thus the next section will utilize insights from both linguistic and psycholinguistic traditions as it explores some of the characteristics of formulaic sequences.

  • 3Formulaic sequences in action

    Selected characteristics of formulaic sequences

    One of the reasons it is dicult to dene formulaic sequences lies in their di-versity. For example, formulaic sequences can be long (You can lead a horse to water, but you cant make him drink) or short (Oh no!), or anything in between. They are commonly used for dierent purposes. They can be used to express a message or idea (The early bird gets the worm = do not procrastinate), functions ([Im] just looking [thanks] = declining an oer of assistance from a shopkeeper), social solidarity (I know what you mean = agreeing with an interlocutor), and to transact specic information in a precise and understandable way (Wind 28 at 7 = in aviation language this formula is used to state that the wind is 7 knots per hour from 280 degrees). They realize many other purposes as well, as formu-laic sequences can be used for most things society requires of communication through language. These sequences can be totally xed (Ladies and Gentlemen) or have a number of slots which can be lled with appropriate words or strings of words ( [someone/thing, usually with authority] made it plain that [some-thing as yet unrealised was intended or desired] ). With this diversity in mind, it is little wonder that dierent researchers have looked at formulaic sequences and seen dierent things, resulting in a variety of terminology to express vari-ous perspectives. The range of this terminology is evident from the fact that Wray (2002: 9) found over fty terms to describe the phenomenon of formulaic language. Below is a sample:

    chunks formulaic speech multiword unitscollocations formulas prefabricated routinesconventionalised forms holophrases ready-made utterances

    The scope of this list made it dicult to even decide on a cover term to use for the notion of formulaic language in this chapter. We have decided to use the term formulaic sequence based on a denition by Wray (2002: 9):

    a sequence, continuous or discontinuous, of words or other elements, which is, or appears to be, prefabricated: that is, stored and retrieved whole from memory at the time of use, rather than being subject to generation or analysis by the language grammar.

    This term covers a wide range of formulaic language, and touches on two key cri-teria of the emphasis in this book: a) we are concerned with sequences of lexis and b) the mind handles, or appears to handle, these sequences at some level of representation as wholes. However, using this denition, Wray argues that even

  • 4 Norbert Schmitt and Ronald Carter

    single words and morphemes can be seen as formulaic sequences. In this chap-ter we wish to focus primarily on multi-word sequences of lexis and so initially searched for other terms. The term formula is often used, but usually to mean a string of formulaic language with idiosyncratic conditions of use, and so is not really suitable for use as a cover term. Similarly, lexical phrase is used by Nat-tinger and DeCarrico (1992) to emphasize the relationship between formulaic language and functional language use. When we were considering the various possible terms, each with their own particular bias, Koenraad Kuiper was most helpful in pointing out that there are two underlying properties which dene the language phenomenon we are trying to capture: a) the units of formulaic language are not merely any sequence of words, but phrases, and b) they are lex-ical items exactly like other lexical items such as words, and with the same prop-erties as words would have if they were phrases. This line of reasoning leads to two obvious terms, phrasal lexical item and phrasal lexeme and we considered carefully the adoption of such terms. However, even bearing in mind such dis-tinctions, we settled in the end on formulaic sequence (FS) as the most compre-hensive term for our investigations.3

    The term formulaic sequence is thus intentionally all-encompassing, cover-ing a wide range of phraseology. Since there is so much diversity, it is dicult to identify absolute criteria which dene formulaic sequences. Rather it is prob-ably more useful to discuss characteristics which are typical of formulaic se-quences, even though every example lexeme might not exhibit each charac-teristic. Wray and Perkins (2000, Figure 2) provide an extensive listing of these characteristics. Also, the interested reader will nd Wray (2002), a book-length treatment of formulaic language to which much of this chapter is indebted, an excellent resource. Assuming that the reader is familiar with the basic concep-tual background regarding formulaic sequences, in this section we will over-view a few of the characteristics which we nd particularly interesting.

    Formulaic sequences appear to be stored in the mind as holistic units, but they may not be acquired in an all-or-nothing manner.

    There is plenty of evidence to suggest that formulaic sequences are typically stored and processed as unitary wholes, even if this is not true in every case. Perhaps the most obvious evidence lies in semantically-opaque formulaic se-quences, such as idioms, where the meaning of the sequence cannot be derived from knowledge of the component words. The only way to know the meaning of the idiom is to have learned it as a sequence. There is also evidence on the phono-

  • 5Formulaic sequences in action

    logical front: formulaic sequences are typically spoken more uently, with a co-herent intonation contour, to the extent that this has been accepted as one crit-erion of formulaticity (e.g. van Lancker, Canter, and Terbeek, 1981; Peters, 1983, p. 10). Moreover, Pawley and Syder (1983) assert that formulaic sequences of-fer processing eciency because single memorized units, even if made up of a sequence of words, are processed more quickly and easily than the same se-quences of words which are generated creatively. This assertion is supported by evidence from Kuiper (1996, this volume) and his colleagues (Kuiper and Haggo, 1984), who show that smooth talkers (auctioneers, sportscasters) use formulaic language a great deal in order to uently convey large amounts of information under severe time constraints. In addition to this productive advantage, there seems to be a receptive advantage as well. Underwood, Schmitt and Galpin (this volume) demonstrate that words, when they are part of formulaic sequences, are read more quickly than the same words when embedded in non-formulaic text.

    One might also assume that there is a processing-based reason behind the fact that the preferred realization of many functions (e.g. making apologies, re-questing) is one or more formulaic sequence. For example, when shifting a topic, we commonly use a formulaic sequence like by the way, but create novel phrases such as Its time for a topic change much more rarely. If creatively-generated lan-guage was cognitively more ecient, we would not expect to nd formulaic se-quences realizing functional language usage nearly as frequently as we do in corpus evidence.

    Formulaic sequences generally appear to be processed as wholes and it is likely that many are also learned as wholes, especially short salient ones like Go Away! However, there are good arguments for why some formulaic sequences are not learned in an all-or-nothing manner. Some rst language (L1) acquir-ers seem to acquire an initial phonological mapping of formulaic sequences proceeding from the whole to the individual parts, but with some elements still incompletely grasped, especially the unstressed phonemic constituents (Peters, 1977; Wray, 2002, Chapter 6). In these cases, the formulaic sequences are learned over time, with the later stages of acquisition consisting of lling in the gaps in the initial incomplete rendering of the sequence. Likewise, some of the com-ponent words in the formulaic sequence, as well as the syntactic structure may not be known initially either. Peters (1983) suggests that these elements may be later extracted from the formulaic sequence through a process of segmen-tation. Another way formulaic sequences are learned over time involves the exible slots many formulaic sequences have which can be lled with seman-tically-appropriate words or phrases. If the formulaic sequences are initially ac-

  • 6 Norbert Schmitt and Ronald Carter

    quired with these slots as part of the structure, one might expect that it would take longer to learn the appropriate language insertions for these slots than to learn the xed elements of the sequence. Alternatively, if the slots are created when paradigmatic variation is noticed at one location in a previously fully-xed string, then this learning is also incremental in the sense that a xed for-mulaic sequence must rst be acquired before it is analyzed to form a formulaic sequence with slots. Moreover, shorter formulaic sequences can be combined together into longer and more complex formulaic sequences (Peters, 1983: 73), which means that the component formulaic sequences need to be learned as the initial step to acquiring the subsequent formulaic sequence.

    The transparency of formulaic sequences might also aect the learning bur-den. Formulaic sequences lie on a continuum of transparency/opaqueness, with idioms at the obscure end, but with many sequences being quite transpar-ent at the other end (my point (here) is that _____). It may well be that trans-parent sequences are learned in a somewhat dierent manner than opaque sequences, perhaps even being generated online in the rst instance through knowledge of the individual component words and knowledge of syntactical sequencing.

    The learning of one kind of lexeme (individual words) is incremental and produces dierent learning burdens (Schmitt, 2000; Nation, 1990), and there is no reason to believe that other types of lexeme (i.e. formulaic sequences) are any dierent in this respect. This would suggest that many formulaic sequences are partially known for a number of exposures until the point where they be-come mastered.

    The question of complete, holistic acquisition vs. incremental acquisition of formulaic sequences is an interesting one, because the answers may eventually determine which formulaic sequences are practical to teach to second language (L2) learners.

    Formulaic sequences can have slots to enable exibility of use, but the slots typically have semantic constraints.

    We have mentioned that some formulaic sequences are completely xed strings of words, while others have slots in addition to their xed elements. There is no doubt that in some cases, xedness is an advantage. For example, Watch Out! is an instantly recognizable warning, precisely because it is xed, and lit-tle processing should be required to understand it. We could shout something like Watch the car coming behind you!, but if milliseconds count, then a shorter,

  • 7Formulaic sequences in action

    more conventionalised warning is likely to be most eective. However, it is an advantage in much of language use to allow more exibility of meaning. For example, if we wish to express the notion that some activity or achievement is unusual, unexpected, or exceptional, then we can use phrases like Diane thinks nothing of running 5 miles before breakfast or He thinks nothing of driving 100 miles per hour on the freeway. The underlying structure to these sentences is _____ thinks nothing of _____, which allows the exibility to express the un-expected notion in a wide variety of situations. This scaold can aid uent lan-guage because some of the language is already preassembled and can be used in a variety of situations.

    The slots in this type of formulaic sequence are not always completely open however; there are often semantic constraints which control which word or words can be used in the slots. In the example above, the second slot must cap-ture the idea of something unusual or unexpected, precisely because that is the reason for using this particular formulaic sequence. Note how the sentence She thinks nothing of sleeping 8 hours per night sounds strange because sleeping that amount of time is usual. Conversely, She thinks nothing of sleeping 14 hours per night seems acceptably surprising.

    Our intuitions say that these exible formulaic sequences are widely-used in discourse, simply because they are adaptable to a wide range of situations. We would expect this suggested broad usage to be evident in corpora. The evidence may well be in the data, but the problem is that exible formulaic sequences are dicult to identify using current concordancing packages. Modern concord-ancers are good at identifying contiguous sequences, but we do not yet have software which can identify exible formulaic sequences automatically from corpora. Once this software is developed, we may nd that exible formulaic sequences are even more prevalent than totally xed ones.

    Formulaic sequences can have semantic prosody.

    Individual words (other than technical vocabulary) usually have a relatively wide range of usage. For example, the noun form of the word border can mean a political boundary, a geophysical boundary, the edge of a something like a piece of fabric, and the verb form can mean being adjacent to such a boundary. How-ever, once the word border is used syntagmatically with other words (e.g. border-ing on), its usage can become constrained. Consider the following concordance lines from the British National Corpus (BNC):

  • 8 Norbert Schmitt and Ronald Carter

    Of the 100 instances of bordering on in the BNC, 27 do refer to a physical loca-tion, but by far the most frequent usage (57 instances) carries the meaning of approaching an undesirable state (of mind). This majority usage entails a nega-tive evaluation of the situation which is key to the meaning sense it imparts.4 This type of evaluation has been referred to as semantic prosody (Sinclair, 2004), and is a feature of a number of formulaic sequences.5 Sinclair illustrates how rife behaves similarly:

    Male chauvinism was rife in medicine in those days. Fears are now rife that the price could plunge well below 30p by the end of

    the year.

    Procient language users know that rife is used to express the meaning some-thing undesirable is too common, and that the formulaic sequence in which rife is embedded typically has the following structure:

    SOMETHING UNDESIRABLE is/are rife in LOCATION/TIME.

    To project the formulaic sequences meaning, one slot has the semantic con-straint something nasty or undesirable. Likewise, the sequence inevitably car-ries a negative connotation, because that is the primary reason this sequence is used. Knowledge of this allows the correct interpretation of the following as an assertion that there are too many artists in the panel system, even though this is not explicitly stated.

    The panel system is rife with artists.

    Thus, just as single words can carry register/appropriacy marking (skinny has a more pejorative marking than thin), formulaic sequences can carry semantic prosody, and it often is a key element of the sequences meaning. So it seems clear

    managers with an abandon bordering on carelessness.demonstrated an intransigence bordering on arrogance.

    been consumed, struck me as bordering on the ill-mannered.class were treated with distrust bordering on disdain.

    sat in a state of sullenness bordering on rage or had conspicuouslyfundamentally disturbed, and bordering on the deeply neurotic or worse.

    area to the south-east of Cumbria, bordering on Lancashire.drawn up to which all states bordering on its coasts should adhere.

    or emerging from property bordering on a road, give way to pedestriansChoose a good hotel, even bordering on the luxurious if you can.

  • 9Formulaic sequences in action

    that formulaic sequences can carry semantic prosody, but to our knowledge no one has done research into how many do and how many do not. This merely re-inforces our impression that there is still a lack of research into many important aspects of formulaic sequences.

    Formulaic sequences are often tied to particular conditions of use.

    The term formulaic sequence is deliberately inclusive, and contains a number of dierent kinds of patterned language. As mentioned earlier, some formulaic sequences are relatively obvious in terms of opacity of meaning and/or xed-ness of form and so have been dened and discussed for quite some time: e.g. phrasal verbs, idiom, proverbs, and xed binomials/trinomials. However, even with these established categories of patterned language, denitions depending solely on descriptions of form and meaning are sometimes not completely clear. For example, most proverbs are semantically opaque, and would be classied as idioms on the basis of that, so what is the dierence between them? One way of dierentiating the two is their conditions of use. Idioms are typically used to express a concept (put someone out to pasture = retire someone because they are getting old), while proverbs typically state some commonly believed truth or advice (The longest journey begins with the rst step = a suggestion not to pro-crastinate, but to begin a long process by taking the rst necessary steps).

    In addition to these traditionally-recognized categories, we would argue that conditions of use can also be used to fruitfully discuss a broader range of for-mulaic sequences. Wray (2002, Chapters 47) oers a comprehensive explora-tion of the roles that formulaic sequences have in children and adults, but here we can highlight only a few key reasons why formulaic sequences are used in communication.

    It has been found that recurring situations in the social world require cer-tain responses from people. These are often described as functions, and include such (speech) acts as apologizing, making requests, giving directions, and com-plaining. These functions typically have conventionalized language attached to them, such as Im (very) sorry to hear about ____ to express sympathy and Id be happy/glad to _______ to comply with a request (Nattinger and DeCarrico, 1992: 6263). Because members of a speech community know these expressions, they serve a quick and reliable way to achieve the related speech act. Nattinger and DeCarrico suggest that the use of formulaic sequences for functional pur-poses is widespread, and we are inclined to agree, but believe that the research is too thin on the ground to truly know the extent of their use.

    One common type of function which is often realized by formulaic sequences

  • 0 Norbert Schmitt and Ronald Carter

    is maintaining social interaction. People the world over engage in light conver-sation for pleasure or to pass the time of day. In these cases, the purpose of com-munication is unlikely to be serious attempts to exchange information or to get someone to do something. Rather, the content is less important than the fact that there is a semblance of communication. In these cases, people rely on a set of conventionalised phatic phrases which are non-threatening and help keep the conversation owing. Examples include comments about the weather (Nice weather today; Cold isnt it), agreeing with your interlocutor (Oh, I see what you mean; OK, Ive got it), providing backchannels and positive feedback to another speaker (Did you really?; How interesting). As Kecskes (2003) points out in a study of what he terms situation-bound utterances, such sequences have the purpose of acting both as a social lubrication and of actively co-constructing interpersonal communication.

    Another specic function formulaic sequences realize is that of discourse or-ganization. This is well known to EAP specialists, who commonly teach various discourse markers in writing classes (in other words, in conclusion). Spoken dis-course is also rich in these organizing phrases, for example: on the other hand (expressing an alternative viewpoint), to put it another way (re-phrasing), as I was saying, speaking of which (providing links to previous utterances).

    Sometimes the purpose of using formulaic sequences is to transact infor-mation in a precise and ecient manner. Technical words in a eld realize this purpose (scalpel is a specic type of knife used in medicine), but technical vo-cabulary does not have to be limited to single words. Indeed, in many elds ex-act phraseology is stipulated to avoid any possible misunderstanding. In avia-tion language, the phrase Cleared to land gives the pilot very specic rights and responsibilities. Likewise, the conventionalised way of reporting blood pres-sure is blood pressure is 140 over 60 and everyone in the medical eld knows to place the higher pressure gure rst. This specic type of technical formulaic sequence is likely to be quite prevalent in technically-based discourse, but again, nobody has yet researched its true extent.

    There are other purposes which formulaic sequences carry out as well, as il-lustrated in Wray (2002). Additional ones are likely to emerge with further re-search. Because formulaic sequences have so many important and frequent uses in language, it should not be surprising that such patterns are frequent in lan-guage. Moreover, because particular sequences are tightly linked to particular language functions or information, our interlocutors expect them, and they are the preferred choice. Thus formulaic sequences are not only useful for ecient language usage; they are essential for appropriate language use.

  • Formulaic sequences in action

    The acquisition of formulaic sequences

    For about two decades, there has been a steadily increasing amount of research being done on vocabulary in general (see Meara, 1987, 1992, 2003), and with it we are also starting to see more interest in formulaic language. Corpus-based research has informed the eld by identifying formulaic language and describ-ing how it is used in discourse. The body of continental work has largely focused on such issues as lexicography, the phraseology of regional dialects, and text lin-guistics (Kon Kuiper, personal communication). However, it is probably fair to say that the amount of research into the acquisition of formulaic sequences has been fairly modest in comparison (see Wray, 2002, for the most comprehensive overview; also Weinert, 1995).

    There is a consensus that some L1 acquirers do learn and use formulaic se-quences before they have mastered the sequences internal makeup. More-over, the acquisition of formulaic sequences might depend to some extent on whether children are referential or expressive learners, that is, whether they are system learners more than they are item-learners (Cruttenden, 1981) (see also Brown, 1973 and Peters, 1983). Nelson (1973) found that children who had ref-erential preferences (naming things or activities and dealing with individual word items) usually learned more single words, particularly nouns. Conversely, children who had more expressive tendencies (having interactional goals; fo-cusing on the social domain) were more likely to learn whole expressions which were not segmented. The reason for these preferences may be psycholinguistic in nature (Bates and MacWhinney, 1987), or may only reect what the child

    supposes the language to be useful for: predominantly naming things in the world or engaging in social interaction (Nelson, 1981: 186). It may also reect the input a child receives: games for naming things in the world or social control clumps such as Dya wanna go out? (Nelson, 1981). Regardless of the underly-ing reason, there seems to be a link between the need and desire to interact and the use of formulaic sequences.

    In L2 acquisition, formulaic sequences are also relied on initially as a quick means to be communicative, albeit in a limited way. This can lead to quicker in-tegration into a peer group, which can result in increased language input. Wong Fillmore (1976) found this was the case with ve young Mexican children try-ing to integrate into an English-medium school environment. She identied eight strategies the children used, and at least three of them directly involved formulaic language:

  • 2 Norbert Schmitt and Ronald Carter

    Give the impression, with a few well-chosen words (phrases), that you speak the language

    Get some expressions you understand, and start talking Look for recurring parts in the formulas you know.

    The use of formulaic sequences enabled the realization of these strategies even though the childrens language capabilities were quite limited. Furthermore, the use of formulaic sequences to facilitate language production is not restricted to L2 children. Schmidts (1983) study of Wes is a good example of the phenom-enon in L2 adults; Wess speech is lled with formulaic language as a means of fullling his desire to be communicative, but not necessarily accurate.

    But formulaic sequences may provide language learners with more than an expedient way to communicate; they might also facilitate further language learning. For L1 learners, it has been proposed that unanalysed sequences pro-vide the raw material for language development, as they are segmented into smaller components and grammar (see Peters, 1983). If so, it is possible that they serve the same purpose for L2 learners (e.g. Bardovi-Harlig, 2002). How-ever, even if this proves not to be the case, there is little doubt that the automatic use of acquired formulaic sequences allows chunking, freeing up memory and processing resources (Kuiper, 1996, and Ellis, 1996 who explores the interaction between short-term and long-term phonological memory systems). These can then be utilized to deal with conceptualising and meaning, which must surely aid language learning. Wood (2002: 5) nicely summarizes the possible double role of formulaic sequences in language acquisition:

    They are acquired and retained in and of themselves, linked to pragmatic compe-tence and expanded as this aspect of communicative ability and awareness devel-ops. At the same time, they are segmented and analyzed, broken down, and com-bined as cognitive skills of analysis and synthesis grow. Both the original formulas and the pieces and rules that come from analysis are retained.

    So sequence-based learning seems to have a part to play in language acquisi-tion. A key question is how large a part it plays compared to grammar-based ac-quisition. Wray and Perkins (2000) and Wray (2002) argue that the balance of sequence-based versus grammatically-generated language varies during an L1 childs development. During Phase 1 (birth to around 20 months), the child will mainly use memorized vocabulary for communication, largely learned through imitation. Some of this vocabulary will be single words, and some will consist of sequences. At the start of Phase 2 (until about age 8), the childs grammat-ical awareness begins, and the proportion of analytic language compared to ho-

  • 3Formulaic sequences in action

    listic language increases, although with overall language developing quickly in this phase, the amount of holistically-processed language is still increasing in real terms. During Phase 3 (until about age 18), the analytic grammar is fully in place, but formulaic language again becomes more prominent. During this phase, language production increasingly becomes a top-down process of for-mula blending as opposed to a bottom-up process of combining single lexical items in accordance with the specication of the grammar (Wray and Perkins, 2000: 21). By Phase 4 (age 18 and above), the balance of holistic to analytic lan-guage has developed into adult patterns.

    The course of formulaic sequence development is more dicult to chart in L2 learners. Typically there is early use of formulaic sequences, often after a silent period. As learners prociency improves, there is the reasonable expectation of language which is more accurate and appropriate. In natives, this is achieved to a large extent through the use of formulaic sequences. Unfortunately, the formu-laic language of L2 learners tends to lag behind other linguistic aspects (Irujo, 1993). This may be partly due to a lack of rich input: Irujo (1986) suggests that idioms are often left out of speech addressed to L2 learners. Learners also seem to avoid the use of idiomatic language (Kellerman, 1978), although this may have more to do with the degree of L1L2 similarity than any intrinsic diculty (Laufer and Eliasson, 1993; Laufer, 2000; Vihman, 1982: 272). There is also the tendency to stick with familiar and safe sequences which the learners feel con-dent in using (Granger, 1998), although De Cock (2000) found that some formu-laic sequences were overused, some underused, and others simply misused by nonnatives when compared to native norms. These tendencies have been noted by researchers, but overshadowing all of these results is the great variation in L2 use of formulaic sequences, which must at least partially stem from the fact that L2 learners are a diverse group in terms of age, manner of acquisition, L1, social environment, etc. (Wray, 2002: 144) . There may well be an underlying system-aticity to the acquisition and use of L2 formulaic language, but there is simply not enough focused research at present to say very much with conviction.

    One interesting development is the emergence of pattern-based models of acquisition, which posit that the human facility for language learning is based on the ability to extract patterns from input, rather than being under the guid-ance of innate principles and parameters which determine what aspects of grammar can and cannot be acquired (see Ellis, 1996, 2002, SSLA 24). This line of thinking suggests that we learn the letter sequences which are acceptable in a language (the consonant cluster sp can be word-initial in English, but hg can-not) simply by repeatedly seeing sp at the beginning of words, but not hg. This

  • 4 Norbert Schmitt and Ronald Carter

    learning is implicit, and may not be amenable to conscious metalinguistic ex-planation. Of course, learners may eventually reach the point where they can declare a rule for this consonant clustering, but the rule is an artefact of the pattern-based learning, rather than the underlying source of learning. This pat-tern-based learning also works for larger linguistic units, such as how sequences of morphemes can combine to form words (un-question-able, un-reli-able, un-fathom-able). Moving to words, we gain intuitions about which words collocate together and which do not (blonde hair, *blonde paint; auburn hair but only for women, not men). Many of these collocations must be based solely on pat-tern recognition, because there is often no semantic reasoning behind accepta-ble/nonacceptable pairings (*blonde paint makes perfect logical sense). Neither are collocations likely to be learned explicitly, because they are not normally taught, and even if they are, only possible cases are illustrated, not inappropri-ate combinations. Longer formulaic strings, which are also based on patterns rather than rules, seem to t very nicely with such sequence-based models of acquisition as well. Time will tell whether this kind of model best captures the mechanics of formulaic sequence acquisition (and that of language in general), but one thing seems certain. Given the increasingly evident importance of for-mulaic sequences in language use, convincing explanations of the mechanics of their acquisition must become an essential feature of any model of language acquisition.

    Issues explored in this volume

    This volume has two main purposes. It reports on some of the rst sustained re-search into the acquisition, processing, and use of formulaic sequences. Equally important, it utilizes a wide range of methodologies to explore formulaic se-quences, some of them used for the rst time. As such, the volume models meth-odological directions for future research in this area, and illustrates how innova-tive research methods can be fruitfully applied.

    It is dicult to t the chapters in this volume into neat categories, but some logical grouping was possible. The rst three chapters provide backgrounding for the studies to follow. Chapters 46 report on the acquisition-based CRAL studies. Chapters 79 report on the CRAL studies focusing on the processing of formulaic sequences. The next two chapters do not t into any particular cat-egory, but Chapters 12 and 13 have a denite pedagogic element. The rest of this section provides brief overviews of the volume chapters.

  • 5Formulaic sequences in action

    It should be clear from the brief overview in this chapter that numerous issues need to be explored concerning how formulaic sequences are acquired, processed, and used. This requires research, and most of this research will be empirical. This means that valid and reliable measures of formulaic sequences need to be developed or rened. Read and Nation consider measurement meth-odology in Chapter 2, providing an overview of issues which need to be consid-ered when tapping formulaic sequence knowledge.

    Much of everyday language is conventionalized, and this conventionali-zation is realized by various types of formulaic sequence. However, there are some kinds of language which are exceptionally conventionalised. Some ex-amples of this are language which routinely covers the same topics over and over again (weather reporting, oral heroic poems), language where speed is important (auctioneering, sports reporting), and language where very precise formulations are required (air trac control). Exploration of how formulaticity is involved in this kind of language use can provide insights into how it is used in more general circumstances. In Chapter 3, Kuiper reviews his and other re-search into highly conventionalized language and highlights the advantages of formulaic sequences in this language, as well as showing how the acquisition of situation-specic formulaic sequences (and the attending cultural knowledge) requires a long-term learning process. The reader should be aware however, that Kuiper uses somewhat dierent terminology and denitions concerning for-mulaic language than most of the other chapters in this volume.

    Corpus evidence shows that formulaic sequences are widespread in native language. However, some research indicates that nonnatives have limited mas-tery of a limited number of formulaic sequences. Schmitt et al. address this issue directly in Chapter 4. The research team measured the productive and receptive knowledge of academically-based formulaic sequences in EAP students study-ing to enter British universities. They found that the students knew a surpris-ing number of the formulaic sequences even before they entered the program, and knew most of them after the program nished, indicating that learning had taken place. Somewhat surprising though, the attitude/motivation and aptitude factors measured as part of the study did not predict this improvement.

    Even though the participants in the above study were able to improve their knowledge of formulaic sequences as a group, obviously some learners im-proved more than others. Using the classic good learner/poor learner design, in Chapter 5 Drnyei, Durow, and Zahran explore four successful and three unsuccessful learners in detail using a series of extended interviews. From this rich one-on-one data, they nd that success in acquiring formulaic sequences

  • 6 Norbert Schmitt and Ronald Carter

    seems to be strongly related to the participants active involvement in the Eng-lish-speaking social community. Unfortunately, some of the international stu-dents in this study found it extremely dicult to join host-national networks. The study suggests that if sociocultural adaptation is absent, only a combination of particularly high levels of language aptitude and motivation can compensate for this lack.

    The theme of socio-cultural integration is investigated in depth in Chapter 6. Adolphs and Durow analyze the spoken output of one high-integration student and one low-integration student to track their use of formulaic sequences over seven months at a British university. In the rst analysis, the participants pro-duction of 3-word formulaic sequences is tallied, and only the high-integra-tion student seems to show any real progress. However, this tally only shows the number of sequences produced, but not their quality. The authors carry out a second analysis in which they rst compile a list of the most frequent 15 words in the participants output, and then run a sequence analysis to iden-tify the sequences which form around these words (e.g. know I dont know). The sequences from the participants production are subsequently compared to CANCODE norming data. Based on this analysis, the high-integration student clearly outperforms the low-integration student, providing additional evidence for the importance of socio-cultural integration in the acquisition and use of formulaic sequences.

    Corpus analysis has shown that there are a great number of word clusters which recur at varying degrees of frequency within a corpus. However, what does the existence of recurrent clusters in corpora tell us about how those clusters are stored and processed by the human mind? In Chapter 7, Schmitt, Grandage, and Adolphs embed a variety of recurrent clusters drawn from cor-pus analysis into a psycholinguistic dictation task to see how natives and non-natives are able to reproduce those clusters. The results show that, for the natives, although some of those clusters are likely to be stored holistically in the mind, a large number are not. The nonnative performance suggests that very few of the clusters are holistically stored in a way that would facilitate accessible retrieval and uent use. The authors conclude that it cannot be assumed that recurrent clusters identied through corpus techniques are necessarily stored in the mind in a holistic manner.

    The next two chapters explore how formulaic sequences are processed, using techniques borrowed from psychology. In Chapter 8, apparatus is employed which tracks the eye movements of participants as they read passages in which formulaic sequences are embedded. Underwood, Schmitt, and Galpin nd that

  • 7Formulaic sequences in action

    both natives and nonnatives have fewer eye xations on words which are part of a formulaic sequence, than the same words when they are part of non-formulaic text. The natives also focus on the formulaic sequence words for shorter dura-tions, although the gaze periods for nonnatives do not dier between formulaic and nonformulaic words. The overall results indicate that there is a processing advantage for formulaic sequences, at least in terms of reading.

    In Chapter 9, Schmitt and Underwood use the same passages with embed-ded formulaic sequences, but this time the task for participants is to read the passage one word at a time within a self-paced reading paradigm. The partici-pants tap a button to bring up each subsequent word in a passage, and the time between taps measured. In contrast to the above study, this technique shows no dierence in recognition speed between the words in their formulaic vs. non-formulaic environments. However, for the nonnative participants, words ap-pearing in formulaic sequences that were known are recognized faster than words in unknown formulaic sequences. This may well reect the diculty the nonnatives have with the unknown formulaic sequences. Overall, the results are less than clear, and the authors suggest that the self-paced reading technique needs to be rened for further investigations.

    Formulaic sequences seem to be a common feature across languages. Thus knowing a formulaic sequence in one language may aect the way it is learned in another. Spttl and McCarthy (Chapter 10) examine participants who knew, or were learning, three or more languages and compare their knowledge of formulaic sequences across those languages. A think-aloud protocol analysis found that participants move between formulaic sequences among their var-ious languages in mainly three ways: 1) the formulaic sequence is translated between languages holistically, without hesitation, repetition, or evaluation, 2) when the initial attempt at translation fails, the formulaic sequence itself is re-peated and an evaluation of various possibilities evaluated, and 3) when the ini-tial attempt at translation fails, the individual words of the formulaic sequence are repeated (but not the whole sequence), and a search process initiated which focuses on those words or the grammar of the language. The second approach is found to be most common, and a number of strategies are identied within this approach. The authors also nd that their participants are not particularly good at assessing their true knowledge of target formulaic sequences.

    A perpetual question in pedagogy is how to present target items to learn-ers. Presumably anything that makes those items more salient or noticeable is benecial for learning. In Chapter 11, Bishop explores whether the use of ty-pographical highlighting (underlining and red font) of words and formulaic

  • 8 Norbert Schmitt and Ronald Carter

    sequences encourages nonnative learners to click on those items for glosses. Participants look up more glosses for unknown words than unknown formu-laic sequences for unhighlighted items, but for highlighted items, this result is reversed. This indicates that such highlighting can make formulaic sequences more noticeable. It has been claimed that formulaic sequences are less easily recognizable as holistic entities than words, because unlike words with spaces around them to indicate their boundaries, it is not clear where the boundaries of unknown formulaic sequences lie. If this is true, then highlighting the form of formulaic sequences can make their wholeness apparent, which may facili-tate learning.

    It has often been assumed that formulaic sequences take a long time to ac-quire. However, what would happen if they were taught intensively over as short a period as ve days? Wray (Chapter 12) reports on a learner taking part in the British television program Welsh in a Week. The participant studies formu-laic sequences with the purpose of becoming suciently uent with a limited amount of Welsh in order to meet the challenge of a public presentation. How-ever, although the learner understands that she would be most successful if she simply memorized the material given to her, by ve months after her perform-ance she had introduced typical learner errors into what she remembered of the original material. This suggests that the adult learners need to analyze linguistic material is unavoidable, and implies that the teaching of formulaic material to post-pubescent learners may be an uphill struggle.

    Jones and Haywood also take a pedagogical approach in Chapter 13, but this time in a traditional EAP classroom. They report on their eorts to develop ma-terials for and to teach formulaic sequences to their students over a period of ten weeks. The students are initially sceptical about the value of focusing on for-mulaic sequences, but seem to eventually realize their importance. The authors carefully track their students and nd some evidence of modest gains in formu-laic sequence knowledge on a test by the end of the study, although there is no substantial evidence of this in the students writing. However, there is clear evi-dence that the students had increased their awareness of formulaic sequences in general.

    Other lines of research into formulaic sequences

    This volume reports on research specically into the acquisition, processing, and use of formulaic sequences. But in the end it is only one book and cannot

  • 9Formulaic sequences in action

    hope to cover the many diverse questions which beg for answers. A few of these questions are listed here as intriguing prompts for any researcher who might want to pursue studies in this important developing area.

    1. Once learned, are formulaic sequences overused or underused in terms of the norms of stylistic appropriacy of the speech community, in the same way individual words can be over- or underused?

    2. How are formulaic sequences acquired in naturalistic and formal settings? What is the same/dierent about learning formulaic sequences in these set-tings? What is the best way to teach formulaic sequences? Can they be taught at all?

    3. What is the relationship between knowledge of formulaic sequences and knowledge of their individual component words?

    4. How many exposures are necessary to learn formulaic sequences with vari-ous kinds of input? Is it the same as for individual words?

    5. What is the nature of attrition of formulaic sequences? Are some elements retained better than others, or is the whole chunk either retained or forgot-ten?

    6. Which elements of a formulaic sequence are most salient? Do formulaic se-quences cluster around a key word or core collocation?

    7. Are formulaic sequences learned in an all or nothing manner?8. Does giving attention to formulaic sequences increase the chances of their

    acquisition?

    There are numerous other questions and we hope that this volume will be fol-lowed by many exploring this area. If it is accepted that formulaic sequences play an important part in language use, then any further research can only add to our knowledge of second language acquisition, linguistic theory, and many other applied linguistic areas.

    Notes

    . Sinclair illustrates how both principles are essential but that attention has, especially within the Chomskyan tradition, normally been devoted mainly to the former principle.

    2. It should be noted that continental researchers have treated multiword units as an impor-tant feature of language for decades. However, they often published in German and Russian, and so their impact was not as great as it might have been in the Anglophone world. For en-try into some of this research, see Zgusta (1971), Aisenstadt (1981), Meluk (1981), Howarth (1996), Cowie (1998), and Burger (2003).

  • 20 Norbert Schmitt and Ronald Carter

    3. Some authors in this book have chosen to use other terms for various reasons, but formu-laic sequence will be the cover term used in most chapters.4. Bordering on is also used to express positive evaluation, as in the hotel example, in a mi-nority of cases (9 instances out of the 100).5. Stubbs (1995) describes the same phenomenon, referring to it as collocational prosody. Also, see Stubbs (2002) for a range of corpus-based studies of formulaic sequences.

    Acknowledgements

    Our deepest appreciation goes to Alison Wray and Kon Kuiper who gave us detailed feedback on an earlier draft of this chapter. Their comments were in-valuable in helping us to sharpen our thinking and much of what is good in the chapter draws heavily upon those comments.

    References

    Aisenstadt, E. 1981. Restricted collocations in English lexicology and lexicography. ITL 53: 5361.

    Bardovi-Harlig, K. 2002. A new starting point? Investigating formulaic use and input in fu-ture expression. Studies in Second Language Acquisition 24 : 189198.

    Bates, E. and MacWhinney, B. 1987. Competition, variation, and language learning. In Mech-anisms of Language Acquisition, B. MacWhinney (ed.), 157193. Hillsdale NJ: Lawrence Erlbaum.

    Biber, D., Johansson, S., Leech, G., Conrad, S., and Finegan, E. 1999. Longman Grammar of Spoken and Written English. Harlow: Longman.

    Brown, R. 1973. A First Language. London: Allen and Unwin.Burger, H. 2003 (2nd ed.). Phraseologie: Eine Einfhrung am Beispiel des Deutschen. Berlin:

    Eric Schmidt Verlag.Cowie, A. P. 1998. Phraseological dictionaries: Some East-West comparisons. In Phraseology:

    Theory, Analysis, and Applications, A. P. Cowie (ed.), 209228. Oxford: OUP.Cruttenden, A. 1981. Item-learning and system-learning. Journal of Psycholinguistic Research

    10: 7988.de Cock, S. 2000. Repetitive phrasal chunkiness and advanced EFL speech and writing. In

    Corpus Linguistics and Linguistic Theory, C. Mair and M. Hundt (eds), 5168. Amster-dam: Rodopi.

    Ellis, N. C. 1996. Sequencing in SLA: Phonological memory, chunking, and points of order. Studies in Second Language Acquisition 18: 91126.

    Ellis, N. C. 2002. Frequency eects in language processing: A review with implications for theories of implicit and explicit language acquisition. Studies in Second Language Ac-quisition 24: 143188.

    Erman, B. and Warren, B. 2000. The idiom principle and the open-choice principle. Text 20: 2962.

  • 2Formulaic sequences in action

    Foster, P. 2001. Rules and routines: A consideration of their role in the task-based language production of native and non-native speakers. In Researching Pedagogic Tasks: Second Language Learning, Teaching, and Testing, M. Bygate, P. Skehan, and M. Swain (eds), 7593. Harlow: Longman.

    Granger, S. 1998. Prefabricated patterns in advanced EFL writing: Collocations and formu-lae. In Phraseology: Theory, Analysis and Applications, A. P. Cowie (ed.), 145160. Ox-ford: OUP.

    Howarth, P. 1996. Phraseology in English Academic Writing: Some Implications for Language Learning and Dictionary Making. Tbingen: Max Niemeyer.

    Irujo, S. 1986. A piece of cake: Learning and teaching idioms. ELT Journal 40: 236242.Irujo, S. 1993. Steering clear: Avoidance in the production of idioms. International Review of

    Applied Linguistics in Language Teaching 31: 205219.Jackendo, R. 1995. The boundaries of the lexicon. In Idioms: Structural and Psychological

    Perspectives, M. Everaert, E. van der Linden, A. Schenk, and R. Schreuder (eds), 133166. Hillsdale NJ: Erlbaum.

    Kecskes, I. 2003. Situation-Bound Utterances in L1 and L2. Berlin: Mouton de Gruyter.Kellerman, E. 1978. Giving learners a break: Native language intuitions as a source of predic-

    tions about transferability. Working Papers in Bilingualism 15: 309315.Kuiper, K. 1996. Smooth Talkers: The Linguistic Performance of Auctioneers and Sportscasters.

    Mahwah NJ: Lawrence Erlbaum.Kuiper, K. and Haggo, D. 1984. Livestock auctions, oral poetry, and ordinary language. Lan-

    guage in Society 13: 205234.Laufer, B. 2000. Avoidance of idioms in a second language: The eect of L1-L2 degree of simi-

    larity. Studia Linguistica 54: 186196.Laufer, B. and Eliasson, S. 1993. What causes avoidance in L2 learning: L1-L2 dierence, L1-L2

    similarity, or L2 complexity? Studies in Second Language Acquisition 15: 3548.Meara, P. 1987. Vocabulary in a Second Language: Vol. 2. London: Centre for Information on

    Language Teaching and Research (CILT).Meara, P. 1992. Vocabulary in a second language. Volume III 19861990. Reading in a Foreign

    Language 9: 761837.Meara, P. The Vocabulary Acquisition Research Group Archive (VARGA). Internet resource:

    http://www.swan.ac.uk/cals/calsres/varga/index.htm. Accessed June 21, 2003.Meluk, I. 1981. Meaning text models: A recent trend in Soviet linguistics. Annual Review of

    Anthopology 10: 2762.Meluk, I. 1995. Phrasemes in language and phraseology in linguistics. In Idioms: Struc-

    tural and Psychological Perspectives, M. Everaert, E. van der Linden, A. Schenk and R. Schreuder (eds), 167232. Hillsdale NJ: Erlbaum.

    Moon, R. 1997. Vocabulary connections: Multi-word items in English. In Vocabulary: De-scription, Acquisition and Pedagogy, N. Schmitt and M. McCarthy (eds), 4063. Cam-bridge: CUP.

    Nation, I. S. P. 1990. Teaching and Learning Vocabulary. New York: Heinle and Heinle.Nattinger, J. R. and DeCarrico, J. S. 1992. Lexical Phrases and Language Teaching. Oxford:

    OUP.Nelson, K. 1973. Structure and Strategy in Learning to Talk. Monographs of the Society for

    Research in Child Development, Serial no. 149, nos 12.Nelson, K. 1981. Individual dierences in language development: Implications for develop-

    ment and language. Developmental Psychology 17: 170187.

  • 22 Norbert Schmitt and Ronald Carter

    Pawley, A. and Syder, F. H. 1983. Two puzzles for linguistic theory: Nativelike selection and na-tivelike uency. In Language and Communication, J.C Richards and R. W. Schmidt (eds), 191225. London: Longman.

    Peters, A. M. 1977. Language learning strategies: Does the whole equal the sum of the parts? Language 53: 560573.

    Peters, A. 1983. The Units of Language Acquisition. Cambridge: CUP.Schmidt, R. W. 1983. Interaction, acculturation, and the acquisition of communicative com-

    petence: A case study of an adult. In Sociolinguistics and Language Acquisition, N. Wolf-son and E. Judd (eds), 137174. Rowley MA: Newbury House.

    Schmitt, N. 2000. Vocabulary in Language Teaching. Cambridge: CUP.Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: OUP.Sinclair, J. 2004. Trust The Text: Lexis, Corpus, Discourse. London: Routledge.Stubbs, M. 1995. Collocations and semantic proles: On the cause of trouble with quantita-

    tive studies. Functions of Language 2: 133.Stubbs, M. 2002. Words and Phrases: Corpus Studies of Lexical Semantics. Oxford: Blackwell.van Lancker, D., Canter, G. J., and Terbeek, D. 1981. Disambiguation of diatropic sentences:

    Acoustic and phonetic cues. Journal of Speech and Hearing Research 24: 330335.Vihman, M. M. 1982. Formulas in rst and second language acquisition. In Exceptional Lan-

    guage and Linguistics, L. K. Obler and L. Menn (eds), 261284. New York: Academic Press.

    Weinert, R. 1995. The role of formulaic language in second language acquisition: A review. Applied Linguistics 16: 180205.

    Wong Fillmore, L. 1976. The Second Time Around: Cognitive and Social Strategies in Second Language Acquisition. Unpublished PhD thesis, Stanford University.

    Wood, D. 2002. Formulaic language in acquisition and production: Implications for teaching. TESL Canada Journal 20: 115.

    Wray, A. 2002. Formulaic Language and the Lexicon. Cambridge: CUP.Wray, A. and Perkins, M. R. 2000. The functions of formulaic language: An integrated model.

    Language and Communication 20: 128.Zgusta, L. (ed.). 1971. Manual of Lexicography. Mouton: The Hague.

  • Measurement of formulaic sequences

    John Read and Paul NationVictoria University of Wellington

    Introduction

    Most of the research on formulaic sequences until now particularly that done before the advent of computers and the eld of corpus linguistics has primarily involved descriptive work to exemplify and classify multiword units which scholars have considered to function lexically rather than grammat-ically in the language. However, if work in this area is to advance and to move into the mainstream of applied linguistic research, it is necessary to address some important methodological issues that arise in the investigation of these lexical units. This chapter draws on insights from research methodology and language testing to identify particular problems of measurement in dealing with formulaic language and propose how they might be solved. We will illus-trate some of our points by reference to the work reported in other chapters of this volume.

    One of the exciting developments in recent years is the realisation that for-mulaic sequences have been of long-standing interest to scholars in a whole var-iety of disciplines both inside and outside applied linguistics. Thus, in a sense, we are currently in a phase of surveying and attempting to integrate the insights that have been gained by researchers working in dierent elds all around the world without necessarily being aware of what others were doing. This is well illustrated by Wrays (2002) excellent book, which draws together work in gen-eral linguistics, phraseology, lexicography, corpus linguistics, rst and second language acquisition, language teaching, neurolinguistics and other disciplines. It is important to note that scholars in these various elds not only bring their own theoretical perspectives to bear on the study of formulaic language but also have distinctive methodological approaches to their work. This of course is a familiar situation in an interdisciplinary eld like applied linguistics, but what it means is that it would be unrealistic for us to attempt to impose a single re-search paradigm on the study of formulaic sequences. Thus, in this chapter we will attempt to focus on general principles and issues of measurement that need

  • 24 John Read and Paul Nation

    to be taken into account regardless of the particular research paradigm that the investigator is working within.

    Use of the term measurement may suggest that we favour quantitative or stat-istically based methods of investigation rather than qualitative ones. However, we are adopting a broad denition of measurement which includes criteria for the identication of multiword units as formulaic sequences and for classifying them into categories, even if no further counting of relative frequencies or any other form of statistical analysis is then applied. In addition, we argue that an adequate account of formulaic units as they function in language acquisition and language use can come only from a combination of quantitative and quali-tative analyses. The same already applies, of course, in word-based vocabulary studies. Although it may seem quite straightforward to the nave observer to identify and count words, linguists and vocabulary researchers are well aware of the problematic nature of the word as a linguistic concept. A purely formal denition of a word as word form is of limited value in itself, as illustrated by one of the early computer-based word frequency counts (Carroll, Davies and Richman, 1971), where people, People, peoples, Peoples, peopled, peoples and Peo-ples are all listed as separate items. Thus, vocabulary scholars have developed more meaningful conceptual units, such as the lemma, homonym, word fam-ily, lexeme or lexical unit, and the raw output of a frequency count needs to be classied at least partially by means of human judgement into one or more of these categories in order to be usable for further analysis. Some of these cat-egories already involve units consisting of more than one word form, such as compound nouns, phrasal verbs and idiomatic expressions. Once we shift the attention to the whole range of multiword units, the basic elements are rather more dicult to identify than individual word forms are and so both quanti-tatively and qualitatively, more sophisticated procedures are required to locate and classify them.

    In this chapter we intend to do the following: We will consider a denition of formulaic sequences and then look at reliability and validity issues in their iden-tication, eventually focusing on the importance of triangulation. Finally, we consider the procedures used in several of the studies included in this volume.

    Denition of the construct

    In modern validity theory in educational measurement, a crucial step initially is to dene the construct at a conceptual level. This then provides a basis for

  • 25Measurement of formulaic sequences

    judging the adequacy of operational measures of the construct. In the case of formulaic sequences, Wray (2002: 9) has proposed a denition which is likely to be very inuential but it also needs to be subject to critical scrutiny. If her def-inition is adopted, then the ultimate goal of an analysis will be to identify se-quences that are stored and retrieved whole from memory at the time of use. This is a challenging goal because the means of storage and retrieval of the same sequence can dier from one individual to another, and can dier from one time to another for the same individual depending on a wide range of factors such as changes in prociency, changes in processing demands, and changes in communicative purpose.

    There is some evidence for this variability from the study of idioms. Grant (2003) did an exhaustive study of what she called core idioms, which are non-compositional (the meaning of the parts does not give the meaning of the whole) and non-gurative (the image created by the unit does not relate to the meaning of the unit). They must also consist of words that can occur in other places. Grant found that English has about 104 core idioms. About 25% are fro-zen, and only 10 had a literal equivalent in the British National Corpus. Even among such a narrowly dened group of items, where we would expect to nd extreme formulaicity, the norm seems to be that there is considerable variation. Here some of the variants of the core idiom pull someones leg:

    pull my blue leg, somebodys leg was being pulled, having his leg pulled, leg pulling, a leg pull, a leg puller, tugged my leg, yank somebodys leg, leg tugged/yanked.

    There is a similar set of variants for put your foot in your mouth:

    put your foot in it, putting his foot in his mouth to the kneecap, put his foot well and truly in his mouth, with her foot in his mouth, foot and mouth, foot-in-mouth moments, foot-and-mouth soldiers, put your feet in your mouth.

    Most of these are low in frequency but there is a lot of variation, even without considering the numerous versions of the object or verb form. This variability however does not prove that all uses of the idiom are not formulaic. It is clear that some of the variations are deliberate attempts to add humour by playing with something that is typically xed. The evidence from the study of core idi-oms suggests that there are probably very few sequences, if any, that are always formulaic, and thus the most valid criteria for deciding formulaicity will be those that take account of features that are present in each particular use of a possible sequence.

    Wrays (2002: 9) denition of formulaic sequences is deliberately inclusive. It

  • 26 John Read and Paul Nation

    goes only a short way towards specifying the form in which a sequence is stored and it states explicitly that the sequence need not be continuous. That is, there may be insertions in it, such as when right bloody is inserted into came a crop-per: came a right bloody cropper. The denition also seems to exclude substitu-tion of items within a sequence, such as the following variations within the pull and person components of pulling my leg :

    pull hispulled herpulls mypulling your legyank etc. ourtug etc. someones his sisters

    Similarly, transformations of a sequence would not be included: chew the fat, fat-chewing, fat-chewers. These substitutions and transformations would be excluded because they would involve generation or analysis of the language grammar (Wray 2002: 9).

    The denition does not specify the form of the items in storage. If it is ver-batim storage, where the actual words of the sequence are stored without the possibility of substitution or transformation, then Grants (2003) research sug-gests we are dealing with only a small number of sequences that are rather in-frequent. This denition of a formulaic sequence is one that Kuiper (this vol-ume) seems to follow. It is relatively easy to identify such sequences because of their xed form, and most researchers would readily consider them formulaic. However, much further along a possible scale of formulaicity are the numerous examples of collocational prosody such as bordering on, where the formula is at a rather abstract level. These sequences allow insertion, inection, substitu-tion, deletion, and transformation which all involve generation or analysis by the language grammar. The term formulaic sequence could not be sensibly ap-plied to such patterns.

    Thus, Grants (2003) ndings challenge the adequacy of Wrays denition of the construct. The interest in formulaic sequences is partly a reaction to the lack of description of semantic patterning in previous descriptions of language. However, semantic patterning and formulaic sequences are not the same thing and so the denition needs to take account of this distinction if it is to be com-prehensive enough to cover the phenomena to be investigated. Given the vari-ability in formulaic language that we noted above, the denition of these se-

  • 27Measurement of formulaic sequences

    quences may need to be tailored to some degree to the specic objectives of each research study.

    Sources of evidence

    Once conceptual issues have been addressed, an essential requirement for the identication of formulaic sequences is to have a source of examples of mul-tiword units for analysis. From a measurement perspective, the key issue in choosing a suitable source is one of sampling: how to ensure that there are suf-cient examples to allow reliable generalisations to be made and, where applic-able, that the sample is representative enough to provide the basis for a valid classication system.

    There is a long-standing practice among grammarians and linguists of build-ing up a collection of examples of idioms or other formulaic sequences, based on their own introspective knowledge of the language plus instances that they encounter through their reading, conversational interaction and other com-municative activities in the language. Some scholars such as Pawley and Sy-der (1983) and Nattinger and DeCarrico (1992) adopted a more structured ap-proach, drawing on transcriptions of spoken discourse and/or written texts of various kinds but without giving specic details of the scope of the source ma-terial. Their work has proved to be very important in applied linguistics in draw-ing attention to the pervasiveness of formulaic sequences and highlighting the variety in both the forms they take and the functions they perform. However, in sampling terms, this general approach will typically create a convenience sam-ple, which is subject to uncontrolled bias. For work in this area to advance, it is necessary to complement such informal collections of examples with more sys-tematic data-gathering procedures that can challenge the perceptions of indi-vidual investigators.

    The obvious source of more systematic evidence is some kind of text database. These now commonly take the form of computer corpora, providing very large samples of language, which can then be searched in an ecient manner. Corpus software generates frequency counts and a whole variety of other quantitative measures. In addition, it can supply lists of words and word strings that meet particular specications as the basis for qualitative analyses of idiomaticity, se-mantic transparency, semantic vs. pragmatic meaning, and so on.

    There are a number of options when it comes to the choice of a corpus for the analysis of formulaic sequences.

  • 28 John Read and Paul Nation

    Large general corpora

    Mega-corpora such as the Bank of English and the British National Corpus lend themselves well to certain kinds of research on formulaic sequences, for similar reasons to the enormous contributions they have made to lexicography, word-based vocabulary studies, and descriptive grammars, among others. However, depending on the particular focus of the research, they also have some limita-tions.

    There is bias in the sample of texts they include. The most obvious one is that spoken language is underrepresented, but there is also bias in style (overrep-resentation of formal, informative prose) and genre (journalistic texts in the Bank of English).

    Even in such large corpora, particular kinds of formulaic sequence may have quite low frequency, as Moon (1998) found in her research on idioms, prov-erbs and similes.

    Although corpus software is getting more sophisticated all the time, there are still limits on what it can nd in a large corpus.

    The particular kinds of text that are of interest (eg learner language; storytell-ing to schoolchildren) may not be in the corpus at all.

    Specialized corpora

    There are a fast growing number of more specialized corpora which oer oppor-tunities to investigate formulaic sequences in more particular varieties of lan-guage. These include corpora of spoken language (the London Lund Corpus, the Cambridge and Nottingham Corpus of Discourse in English CANCODE), learner language (the International Corpus of Learner English ICLE), child language (The Child Language Data Exchange System CHILDES), regional varieties (the International Corpus of English ICE corpora, the Brown corpus of American English and the various parallel corpora of other national varieties), and discipline-specic corpora.

    The issues involved in selecting a particular corpus include considering whether the corpus ts the particular requirements of a proposed formulaic sequence study, whether it is accessible by other researchers (than the original compilers), whether the corpus is large enough to satisfy reliability require-ments, and whether certain crucial kinds of information about the texts are available in the corpus, for example, the specic sources of written texts or par-ticular phonological notation for oral texts. Given the pragmatic dimension

  • 29Measurement of formulaic sequences

    to the meaning of many formulaic sequences, especially in oral language use, the researcher may require richer contextual information than the corpus pro-vides.

    A further category includes collections of written or oral texts that may not be thought of as constituting a corpus, such as the reanalysis by Foster (2001) of the transcripts from the Skehan and Foster research on task-based language learning.

    Purpose-built databases

    If existing corpora do not meet the research requirements, it will be necessary to build a set of data from scratch. This does not necessarily involve compiling a whole corpus (whatever the minimum dimensions of that might be). It may simply be the kind of data-gathering that sociolinguists, discourse analysts and others routinely engage in to collect samples of language use, either by unobtru-sive recording of natural speech events or by elicitation procedures. Kuipers studies of race callers, auctioneers and checkout operators are good examples of these (see Chapter 3).

    Procedures for identication and classication

    As previously indicated, in its present stage of development the study of formu-laic sequences still faces fundamental problems in identifying the units of analy-sis within a database or corpus. Wray (2002: Chap 2) gives a comprehensive dis-cussion of the criteria that have been proposed or applied in previous research. We will summarize the criteria here and explore the measurement issues.

    Intuition

    The status of the intuition of an individual investigator is dubious from a mod-ern scientic perspective. The exercise of this kind of subjective judgement is likely to be more acceptable if one or more of the following conditions apply:

    a denition of what is meant by a formulaic sequence is carefully formulated in advance, as previously discussed.

    the investigator communicates the denition to a second person, who then at-tempts to replicate the investigators identication of the formulaic units.

  • 30 John Read and Paul Nation

    instead of relying on the researchers judgement, a panel of judges is formed to analyse the database and a multiword unit is accepted as formulaic only when most, if not all, the judges identify it as such.

    In other words, what is required is intersubjectivity or, in measurement terms, a high degree of inter-rater reliability.

    Nevertheless, as Wray (2002: 2025) points out, even meeting these basic conditions is not straightforward in the case of formulaic language. Corpus lin-guists such as Sinclair (1991) argue that their research reveals intuition to be a very fallible means of investigating the facts of language use, with regard to the relative frequency of linguistic features, typical meanings of lexical items, char-acteristic patterns of collocation, and so on. Secondly, in the context of second language acquisition research, the native speaker intuitions of the researcher are often brought to bear to account for the language production of learners, who may or may not have an intuitive basis for what they say or write in the sec-ond language. This means that the formulaic status of sequences in learner lan-guage is even more dicult to establish by means of intuition than in the case of native speaker production. A third diculty identied by Wray is that recogni-tion of formulaic language may depend on the shared knowledge which comes from membership of a particular speech community rather than being univer-sal among users of the language concerned. This represents just one more limi-tation on the value of intuition as an investigative procedure.

    Corpus analysis

    Computer corpus analysis has added a powerful new tool to the range of pro-cedures available for the study of formulaic sequences. Moving beyond the concept of locating and counting individual word forms, corpus software can search for specied headwords, combinations of words and even discontinu-ous sequences of words. Thus, if the investigator can specify particular words or word strings that are potentially formulaic (or known to be so on the basis of other evidence), the software can instantly assemble all of the examples in the corpus for inspection and further analysis. An alternative approach is a purely statistical procedure that identies sequences of two, three or more words that regularly co-occur throughout the corpus beyond a threshold level of probabil-ity. This second approach has produced a great deal of data that turns out not to be formulaic, depending on the denition of formulaic language adopted, but on the other hand it has shown its potential to give new insights into multi-

  • 3Measurement of formulaic sequences

    word units that are not available through intuition. In both cases, the quantita-tive evidence supplied by the software needs to be evaluated by the application of human judgement to determine which of the word sequences are formulaic

    and if a classication system is involved, which ones t in which categories.Concordance software such as that included in Wordsmith Tools and SARA

    can be used to nd collocational clusters in corpus data. The most exible soft-ware allows the researcher to specify a search word or words and to gather and count the occurrences of collocates for several positions on either side of the search node. Such software is an extremely valuable tool for research on for-mulaic language. However, it is essential for the researcher to examine each in-stance of the data to make sure that it is relevant. One way to demonstrate this point is by means of a training exercise employing the SARA software on the British National Corpus. The task is to use corpus data to answer the question, Are men beautiful?. That is, do men and beautiful collocate? A corpus search with men as the node and beautiful as the collocate, using a 6 to the left 6 to the right span, found 38 instances. In only ve of these were they really collocates. A more limited search of the same corpus using 3 to the left and right produced ten instances of which only four were collocates. Excluding right hand occur-rences of beautiful would not change the result substantially. Here are the ten instances.

    Clearly, valid cluster analysis requires manual checking of the data.Another limitation of concordance software is that it can automatically locate

    only contiguous sequences. In order to locate non-contiguous ones, it is neces-sary for the researcher to enter in the search request either a contiguous subpart of the whole sequence or at least one key lexical component of it. This of course assumes the whole sequence is already known to be formulaic. It is very likely

    to see if she were as beautiful as men toldwho felt the need to dress up and be beautiful for their menmade love to the most brilliant and beautiful men of your generationNext to him were two brothers, tall beautiful men with liquid eyes

    There are some beautiful mens clothes aroundYou are so beautiful that men would die for you

    stunningly beautiful to boot. Men wouldMen and beautiful women also join in.

    If you were in Prague, two beautiful men like you,There are some very beautiful young men there.

  • 32 John Read and Paul Nation

    that a substantial proportion of the formulaic language in English remains to be discovered; the non-contiguous nature of the sequences involved means that they fall below the threshold of recognition, whether it be by human intuition or automated computer search.

    In addition to the limitations of corpus anal