triiii

Embed Size (px)

Citation preview

  • 8/22/2019 triiii

    1/3

    1 Methodology1.1 Corpus1.2 Lexical unit1.3 Pedagogy

    2 Effects of words frequency3 Languages3.1 English

    3.1.1 Traditional lists3.2 French3.3 Spanish3.4 Chinese

    4 References5 Sources

    Methodology

    Nation (Nation 1997) noted the incredible help provided by computing capabilities, makingcorpus analysis much easier. He cited several key issues which influence the construction offrequency lists:

    corpus representativenessword frequency and rangetreatment of word familiestreatment of idioms and fixed expressionsrange of informationvarious other criteria

    Corpus

    Traditional written corpus

    Main article: Text corpus

    Most of currently available studies are based on written texts.

    SUBTLEX movement

    However, New et al. 2007 proposed to tap into the large number of subtitles available onlineto analyse large numbers of speeches. Brysbaert & New 2009 made a long critical evaluationof this traditional textual analysis approach, and support a move toward speech analysis andanalysis of film subtitles available online. This has recently been followed by a handful ofcopy-cat studies, providing valuable frequency count analysis for various languages. Indeed,the SUBTLEX movement completed in five years full studies for French (New et al. 2007),American English (Brysbaert & New 2009; Brysbaert, New & Keuleers 2012), Dutch(Keuleers & New 2010), Chinese (Cai & Brysbaert 2010), Spanish (Cuetos et al. 2011),Greek (Dimitropoulou et al. Carreiras), and Vietnamese (Pham, Bolger & Baayen 2011).Lexical unit

  • 8/22/2019 triiii

    2/3

    In any case, the basic "word" unit should be defined. For Latin scripts, words are usually oneor several characters separated either by spaces or punctuation. But exceptions can arise, suchas English "can't", French "aujourd'hui", or idioms. It may also be preferable to group wordsof a word family under the representation of its base word. Thus, possible, impossible,

    possibility are words of the same word family, represented by the base word *possib*. For

    statistical purpose, all these words are summed up under the base word form *possib*,allowing the ranking of a concept and form occurrence. Moreover, other languages maypresent specific difficulties. Such is the case of Chinese, which does not use spaces betweenwords, and where a specified chain of several characters can be interpreted as either a phraseof unique-character words, or as a multi-character unique word.Pedagogy

    Those lists are not intended to be given directly to students, but rather to serve as a guidelinefor teachers and book makers (Nation 1997). Paul Nation's modern language teachingsummary encourages first to "move from high frequency vocabulary and special purposes[thematic] vocabulary to low frequency vocabulary, then to teach learners strategies to sustain

    autonomous vocabulary expansion" (Nation 2006la).Effects of words frequency

    Word frequency is known to have various effects (Brysbaert et al. Blte; Rudell 1993).Memorization is positively affected by higher word frequency, likely because the learner issubject to more exposures (Laufer 1997). Lexical access is positively influenced by high wordfrequency (Segui, Mehler & Frauenfelder Morton1982).Languages

    Below is a review of available resources.English

    Word counting dates back to Hellenistic time. Thorndike & Lorge, assisted by theircolleagues, counted 18,000,000 running words to provide the first large scale frequency list in1944, before modern computers made such projects far easier (Nation 1997).Traditional lists

    The Teachers Word Book of 30,000 words (Thorndike and Lorge, 1944)

    The TWB contains 30,000 lemmas or ~13,000 word families (Goulden, Nation and Read,1990). A corpus of 18,000,000 written words was hand analysed. The size of its inputted

    corpus increased its usefulness, but its age and language change reduced its applicability(Nation 1997).

    The General Service List (West, 1953)

    The GSL contains 2,000 headwords divided into two sets of 1,000 words. A corpus of5,000,000 written words was analysed in the 1940s. Rate of occurrence (%) for differentmeanings and parts of speech of the headword are provided, while it was also a carefulapplication of the various criteria other than frequency and range. Thus, despite its age, someerrors, and its solely written base, it is still an excellent database (word frequency, frequencyof meanings, reduction of noise) (Nation 1997).

    The American Heritage Word Frequency Book (Carroll, Davies and Richman, 1971)

  • 8/22/2019 triiii

    3/3

    A corpus of 5,000,000 running words, from written texts used in United States schools(various grades, various subject areas). Its value is in its focus on school teaching materials,and its tagging of words, namely the frequency of each word in each of the school gradelevels and in each of the subject areas (Nation 1997).

    The Brown (Francis and Kucera, 1982) LOB and related corpora

    These now contain 1,000,000 words from a written corpora representing different dialects ofEnglish. These sources are used to produce frequency lists (Nation 1997).French

    Traditional datasets

    A review has been made by New & Pallier 3.01. An attempt was made in the 1950s60s withthe Franais fondamental. It includes the F.F.1 list with 1,500 high-frequency words,

    completed by a later F.F.2 list with 1,700 mid-frequency words, and the most used syntaxrules.[1] It is claimed that 70 grammatical words constitute 50% of the communicativessentence,[2] while 3,680 words make about 95~98% of coverage.[3] A list of 3,000 frequentwords is available.[4]

    The French Ministry of the Education also provide a ranked list of the 1,500 most frequentword families, provided by the lexicologue tienne Brunet.[5] Jean Baudot made a study onthe model of the American Brown study, entitled "Frquences d'utilisation des mots enfranais crit contemporain".[6]

    More recently, the project Lexique 3 provided a list of 135,000 French words, withorthography, phonetic, syllabation, part of speech, gender, number, frequency, associatedlexemes, etc., available under an open-source license[7]

    Subtlex

    New 2007 made a completely new counting based on online film subtitles.Spanish

    There have been several studies of Spanish word frequency (Cuetos et al. 2011).[8]Chinese

    As a frequency toolkit, Da (Da 1998) and the Taiwanese Ministry of Education (TME 1997)provided large databases with frequency ranks for characters and words. The HSK list of8,848 high and medium frequency words in the People's Republic of China, and the Republicof China (Taiwan)'s TOP list of about 8,600 common traditional Chinese words are two otherlists displaying common Chinese words and characters. Following the SUBTLEX movement,Cai & Brysbaert 2010 recently made a rich study of Chinese word and character frequencies.References