Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
https://lextutor.ca/nancy_present.pdf1
2
Word families vs. lemmas as the counting unit in text coverage research – summary of the debate and resolution
• The acquisition of vocabulary is primary in all aspects of language learning, and vocabulary is only manageable through computational analysis of spoken or written texts or corpora. This presentation will look at some issues in analysing vocabulary and learning to read in English.
Text analysis requires some sort of grouping of words, the two main grouping principles being the word family and the lemma. Families include inflected and derived forms (analyse and analysis), lemmas only inflected (analyse and analyses). The word family is a pedagogical extension of the lemma, and has been used extensively in testing and coverage research into the amount of vocabulary that must be known for various kinds of reading.
• Lately, however, this research has been challenged by supporters of the lemma, on the grounds that learners cannot be assumed to know all the derived forms of a typical word family. But do they have to? Are there enough derived forms in use in typical texts to affect the coverage research results? Morpholex is a text analysis program that was developed to answer these questions.
• The presentation will be mainly in and about English, but connections to French and learning French will be elaborated.
3
Familles de mots vs lemmes comme unité de comptage dans la recherche sur la couverture de texte - résumé du débat et résolution
• L'acquisition du vocabulaire est primordiale dans tous les aspects de l'apprentissage d'une langue, et le vocabulaire n'est gérable que par l'analyse informatique de textes ou de corpus parlés ou écrits. Cette présentation abordera quelques problèmes d’analyse du vocabulaire et d’apprentissage de la lecture en anglais.
• L'analyse de texte nécessite une sorte de regroupement de mots, les deux principaux principes de regroupement étant la famille de mots et le lemme. Les familles incluent les formes infléchies et dérivées (analyser et cet analyse), les lemmes seulement infléchis (analyse et analyses).
• La famille est une extension pédagogique du lemme et a été largement utilisé dans les tests et les recherches de couverture sur la …
• … quantité de vocabulaire à connaître pour différents types de lecture.
• Récemment, cependant, cette recherche a été contestée par les partisans du lemme, au motif que les apprenants ne peuvent pas être supposés connaître toutes les formes dérivées d’une famille de mots typique. Mais doivent-ils? Y a-t-il suffisamment de formes dérivées utilisées dans des textes typiques pour affecter les résultats de la recherche sur la couverture? Morpholex est un programme d'analyse de texte développé pour répondre à ces questions.
• La présentation portera principalement sur l'anglais et sur l'anglais, mais des liens avec le français et l'apprentissage du français seront développés. 4
Outlineof presentation
Definitions• Family• Lemma• Coverage
Family’s role in coverage research
The lemmatizers’ challenge• Their coverage methodology
• Retooled
Methodology
• Development of Morpholex
• Making of a mini-corpus
• Typical reading materials at 4 levels
Results
• What proportion of these texts are derived forms?
• How many individual derived forms are involved?
Next chapter
• Nuclear lists5
Background issues
Interpreting corpora in language pedagogy
Complementarity of corpus + empirical findings
Language teachability• And raising it through text analysis
Coding as research
6
Two kinds of groupings ~Family v. lemma
…expanded into lists of 1,000 (families or lemmas)7
8
a
an
able
ability
abler
ablest
ably
abilities
unable
inability
about
absolute
absolutely
absolutist
absolutists
accept
acceptability
acceptable
acceptably
unacceptable
acceptance
accepted
accepting
accepts
unacceptably
account
accounted
accounting
a
an
able
abler
ablest
about
absolute
absolutes
absolutest
accept
accepted
accepting
accepts
account
accounts
accounted
accounting
achieve
achieved
achieving
achieves
across
act
acts
acted
acting
active
actives
Lemma isBase (or head) word+ Inflections
Family isBase (or head) word+ Inflections+ Derivations
Derivation can involveChange of POS (able→ability)Change of meaning (able→unable)Big change to base word (able→ability)Few ‘rules’
Inflection none of these (able→abler)Hence ‘easier’
Where do all these variant forms come from?A corpusFrequency > x
1k byFam
1k byLem
9
1 2
Et en français ~Fr utilise toujours les lemmes(Though maybe not for long)
10
âgée though similar isat different k-level
école does not include écolier
économieand économiqueare both 1kbut 2 items
Why do we need groupings?
Computationally• Modern corpus is millions of individual words (‘tokens’)
• Impossible to manage individually
Pedagogically• If a learner knows “cat” there is no reason to treat “cats” as a
new word• But : “catty” ?
• Syntax: Different part of speech• Semantics: Rarely applied to cats themselves
• Ss may know cat and not really know catty
• It is a question of how much to include in the groupings11
Why do we need groupings?
• To profile the frequency of words in texts • in a clear and useful manner
• And discover the lexical challenge of different texts• Especially in conjunction
with vocabulary testsemploying the same measure
• What do you notice in this profile?
12
Coverage – the magic numbers 95 and 98
• Coverage = the extent to which a certain word list ‘covers’ (accounts for, contains) a given percentage of a text or corpus
• The coverage points of pedagogical interest have been determined by empirical (not computational) research :• Texts can be comprehended with resources when 95% of
individual word tokens are known• 95% typically corresponds to 5,000 families known
• Texts can be comprehended independently when 98% of the individual word tokens are known• 98% typically corresponds to 8,000 families known
• Those are the words learners need
13
Average text95% = c. 5,000 word fams98% = c. 8,000 word fams
So is this a ‘difficult’ or ‘easy’ text?
14
Average text95% = c. 5,000 word fams98% = c. 8,000 word fams
15
How many words do learners know?For this we use receptive family-k-level-based testing
16Typically % score at a level, x 1,000 families = learners’ receptive knowledge at that level
Now we can match ‘knows’ with ‘needs to know’ for particular texts (types)
17
→
For example
Suppose a learner’s score on VST is this:
1k=80%2k=70%3k=0
And the text he is reading profiles like this
80% at 1k10% at 2k10% at 3k
Typical situation
Then his 1k knowledge gives him 80% x 80% = 64% of the words in the text
And his 2k knowledge gives him 70% x 10% = 7% of the words in the text
So this learners is reading a text with
64% + 7% = 71% of its words known•
While research suggests 95% is minimally needed
• What does it feel like to read a text at 71% coverage? →
18
71% coverage is far from 95%
19
This research, however, is based on the family as unit of word-counting• It would
not be the same if we used the lemma as the counting unit
• Look at 2 profiles →for the same text (Rex M.)
• The diff. being(presumably) ± the derived word forms
20
As in text, so in a corpus
21
Fams & lems : Peda-Pros & Cons
FAMILY
• Whole family is together in one place• Lemma will put very similar
forms in widely separated lists• ADAPT k=3• ADAPTATION k=6• ADAPTABLE k=12
• Seems inefficient• Lemma underestimates the
learner
FAMILY
• A small number of fam. lists cover any text or even corpus• Exhaustive - every word will get
classified by the profiler• While lemma needs many lists to
profile a text• 50% more• With lots of redundancy• Probably unusable in practice
• But family overestimates the learner?
22
The issue: 100 of the best L2 reading studies in recent years…
• Are based on the word familyE.g., Nation, Laufer, Schmitt, Grabe,…
• But is it a convenience to use this unit or a principled decision?• Family = smaller set of lists; tidier computer output; easier for
practitioners to understand; matches common sense (1k=speech, >3k+=text, etc.)
23
A group of researchers, primarily in Japan…
• Working with learners who have little contact with English outside the classroom• And a very examination-driven
approach to language learning
• These researchers strongly dispute the use of the word family
• These researchers argue that no knowledge beyond the lemma can be assumed in their learners
• run runs running• runner × a run ×
• And therefore the coverage research, based on assumed word-family knowledge, does not describe most Japanese learners• Or many other learners
worldwide
24
→ A serious problem
Which has occupied vocab research conferences for past five years
25
The problem could be language interference
• L1-L2 differences in affixation• Japanese affixation is like compounding ?
• Both parts remain identifiable• English can twist the base word quite severely to add an affix
• Able-ability• Particularly affecting the pronunciation
• Pronounce – pronunciation• Such that base word becomes less identifiable
• French may be less problematic here• “L’accent tonique” means base word is not lost ?
• In orthography or pronunciation
• Fr: Science-scientifique
• Eng: ScIence-scIenTIFic 26
Fam v. Lem could have been just one more interminable debate…
Of the type we know so wellExcept that one of the Japan researchers, Dale Brown, pointed to a way forward
• Brown asked whether/how much derived forms are in fact used in texts• Hoping to show they are used a great deal
• To explain his learners’ weak reading ability
• Specifically, are derived forms needed to reach the 95% and 98% coverage points ?• If Yes, then the family-based coverage research does not apply to learners who know
only lemma forms• For the words they know at all
• A brilliant idea to measure word forms’ contributions to coverage• Except that Brown didn’t go all the way 27
What Brown didMétho ~For the first 5,000 head words in Nation’s BNC-Coca family lists ~• He took a random 100-head word sample from each 1,000
• (= 500 head words total)
• Then looked up all the forms for each family on online British National Corpus • Several look-ups for each family
• since BNC is lemmatized• … adding up the frequency figure for all inflected and derived forms in
each family
• (This must resemble Paul Nation’s original fleshing out of these lists as families – but for 500, not 25,000, words)
28
29
Example
30
Able 29657
Ability 9,054
Abilities 1,324
Abler 0
Ably 96
Unable 6,134
Inability 1,087
Inabilities 5
TOTAL 47,357
SoTotal word-forms 47,357Derived forms 17,700>(all but ‘able’ and ‘abler’)
Percent derived
Forms 37.37%
And so on for 500 random families
So if c.35% of a given family in the British National Corpus consists of derivations…
• Then a learner who knew all the words in the corpus• But only as lemmas, not as derivations• Would be reading this corpus with c.65% coverage
• And comprehension would be low
• But: do learners read a corpus ?• No
• Can what’s in a corpus be extrapolated to what’s in its constituent texts?• A good question• A clue to its answer is in another part of the BNC output →
31
BNC is c. 100 million word tokens
Comprising c. 4000 texts• in 100 text-types
• of c.25,000 words each
• Output tracks each search-term back to individual texts• So derived form ‘ability’ is in 2090/4048 = 52% of BNC’s texts
• So far, so good for Brown’s argument
32
But BNC also gives the distribution of these texts in different parts of the corpus
Ex, ‘Ability’
• Barely present in speech, or in fiction
• Fiction : 438 hits in >16 million wds
• News : 718 hits in > 9 million wds
• This raises the question
• Can whole-corpuscoverages be described as general?
• And specifically :Do they represent the type oftexts that ESL learners typically read?
33
How can we track derivations in texts that learners typically read?• Finding typical texts is no problem
• Beginners - graded readers
• Intermediates – novels and newspapers
• Advanced and non-native TESL trainees- Academic research articles
But how to count up the proportion of derived forms in these texts?• Specifically, how to determine if there are enough to undermine 95% and 98% coverage
for learners who know only lemmas
• (For the words they know)
34
Enter Morpholex
Formerly a minor Lextutor routine
• A “list profiler” • For cracking a family (or lemma) list into its base
words and various morphologies
• Example: here is k2 at Bauer and Nation Level 2 (with just inflected forms) →
35
36
Extended in 2019from list profiler to text profiler
37
38
To note :Count (# tokens) and coverage (% of tokens) for each level are given
Ex, Level 3: 12 derivs, 1.7% of text
Cumulative %’s are givenIndicating point where 95 and 98%are met/surpassed
Particular affixes are identified that were needed toreach 95 and 98%
Ex, Base + Inflect + 1 deriv affix (~ion) were needed to reach 95%
Baseword check
• A few errors • 6/695 wds (1%)
• Many derivations where it isnot the case that a basewordcould have been known without the affix
formidable - able = formid ?
39
Definitions & program design (=métho) Levels 1-7?
• These are Bauer & Nation’s (1993) framework for identifying morphology levels
• By frequency, transparency, regularity, degree of change imposed on the base word -‘difficulty’
LEVELS
• 1 = base words
• 2 = inflected forms
• 3-7 = derived forms
Total number in the B&N scheme : 100(not exhaustive)
The Morpholex organigram
Level by level, 1-7 ~
• Each word is matched against a list of B&N prefixes and affixes
• If affix found, asks: Is the remainder without this affix present in a list of all possible words?
• With adaptations like ‘stun’ → ‘stunn’ so that ‘stunning’ minus ‘~ing’ is a real word
• If Yes, it’s an inflected or derived form• If No, it’s a base word
• I.E. It’s considered a derived form onlyif learners could have recognized the base word if not extended into a derived form ***
40
The B&N framework (FYI – not really needed in this presentation)
41
Level 1 Base words
Level 2 Base words + inflections ('lemmas')
-s (on noun or verb), -ed/-ing (on verb), -er (er2)/-est (on adjective), -th (on number), and -en (en2) on irregular verb
-er2 and -en2 are named to separate them from verb+er in Level 3 and noun+en in Level 5
All the rest involve derivations (change in meaning and/or part of speech)
Level 3 Frequent and regular affixes with minimal change to the base word in speech or writing
-able/ible, -er/-or (on verb), -ish, -less, -ly, -ness, -th, -y,
non-, un-
Level 4 Frequent orthographically regular affixes which often impose pronunciation change (admIre => admirAtion
-al (autumnal), -ation (admiration), -ess (fortress), -ful (plentiful), -ism (dogmatism), -ist (semanticist), -ity (solemmnity
in-
Level 5 Less frequent but regular affixes
-age (leakage), -al (arrival), -ally (idiotically), -an (American), -ance (clearance), -ant (consultant), -ary (revolutionary),
anti- (anti-inflation), ante- (anteroom), arch- (archbishop), bi- (biplane), circum- (circumnavigate), counter- (counter
Level 6 Frequent but irregular affixes (often with significant change to base word)
-able (inscrutable), -ee (lessee), -ic (spastic), -ify (mollify), -ion (superstition), -ist (solipsist), -ition (transition), -ive
pre-, re-
Level 7 Classical affixes
-ar (circular), -ate (electorate), -et (packet, casket), -some (troublesome), -ure (departure, exposure)
ab- (abnormal), ad- (admixture), com- (commiserate), de- (demist), dis- (disintegrate), ex- (out - external), in-(in -
Métho (2)
Mini-corpus of 250,000 words
5 instances each of 5 text types, - run individually through Morpholex
• Applied linguistics articles (5)• ‘Quality’ news stories (5)• Classic novels (5)• Simplified novels (7)
42
Results
How manyderived formsby text type?
43
44
45
46
Summary 1
So it seems extensive knowledge of derivations is needed only for academic and quality press• While classic literature needs less
• Derivations are a little over 5% of word in texts• So almost 95% is just base words and inflections
• And graded (simplified) stories virtually none• Base and Inflections alone get us to 95, often to 98
A rather optimistic picture• And it gets better when we count up the number of different individual
affixes (types) →
47
Results (2)
How many dif-ferent affix types?
Even in academictexts, knowledge of just 3 affixes gets learner to 95%
Another 9to 98%
48
Results (2)
News texts similar
Just 3 affixes gets learner to 95%
And another 6affixes to 98%
49
Results (2)
Novels even better
Zero affixes gets learner to 95%
And another 8affixes to 98%
50
Results (2)
Graded novelsbetter still
Zero affixes gets learner to 95%
And another 2affixes to 98%
With average for all →text types 2 and 6
51
Summary 2
• A very small number of derived forms (affixes) is needed across the types to reach 95%• (= “comprehension with resources” - surely the typical situation of a learner)
• And a manageable number to reach 98%
• And it gets even better if we look at the repetition of particular affixes →
52
53
1. CommonalitiesResults (3)– Distribution of affixes across text types
54
2. SpecificitiesResults (3)– Distribution of affixes across text types
Overall picture is of a small handful of affixations in common use
(First 18 of 36) →
Just 8 are > 80% of totalJust 12 are > 90%Just 17 are > 95%
55
(All 36) →
A further 18 affixesmake up the remaining5% of all affixations
56
Summary 3
A small handful of affixations form the vast majority
• Are easily within what an ESL learner can cope with
• Once the affixes identified
• A slightly larger but still manageable number of affixes form the rest
• But are little used
• Forming fewer than 5% of affixations
57
Summary 3b
Brown’s picture of impossible texts with 35% of words unknowable to learners with only lemmas…
is vastly exaggerated
Many texts are available with almost no derived forms
• And any text is accessible with minor preparation• = learning or being taught just 5-10 affixes, even in
academic texts• Which will be re-encountered in whatever texts are presented
thereafter• And would not need to be re-learned
58
Summary 3c
Tests yet againSo family-test scores are reliable predictions of text coverage for test-takers who know only lemmas
• So a score of 80% at 1k means that 80%x70%=56% of words are known at that level????• (i.e., with adjustment of 30% derivatives subtracted)
No!• it means that 80% x 95%=76% of words are known at that level
• (with 5% derivatives subtracted)• And fewer in many text types
59
Summary 3dAnd the existing family-based coverage research is safe, valid, and worth carrying forward
This research is family based, but ~
• The vast majority of derived forms in a given family • While present in a 100 million word corpus of 4,000 texts in 100 text types• + needed so that every word in a VP gets categorized
• Will not be a significant component of particular texts that we might put in front of learners • Derivations will be well under 5% in most cases and under 2% in many• And any effort invested in learning these will be well repaid
• Since they are few in number and much repeated
• So it is a tale of many family members that we don’t see very often• Except when amassed in a corpus 60
Summary 3eIt is worth mentioning that Paul Nation long ago proposed that lemma and family are not an opposition
just different stages in word knowledge
It is kind of an accident that the coverage work was done exclusively with families
A parallel strand of coverage work could easily use the lemmaIt’s just that it wasn’t done→ A kind of oversight
61
How surprising is any of this?
English has been losing its morphology for centuries
Nominative, accus-ative, genitive, etc.
Inflections→
62
63
Derivations→
A topic foranother day
So - is there a reduction in derivations over time?Four non-literary texts over the centuries
Could be a trend
64
Further work
So with all that, where does this work go next?
• As well as supporting the future of the coverage research, it has some future of its own• To do with →unused family members• And developing word lists that accommodate these
• The purpose of exhaustive lists is clear• But surely more targeted lists would also have their uses
• Or call them ‘nuclear’ lists
We saw that
• While many affixations exist across text types…• ~ly, un~
• …others form ‘a profile’ specific to particular text types• ~ful, ~y & ~ness in FICTION
wonderful, beautiful; dreamy, rainy; happiness, gladness • ~al & ~age in ACADEMIC/NEWS
societal, governmental, culturalusage, coverage, percentage
65
Toward non-exhaustive or ‘nuclear’ lists
These too would have their usesThe word forms most used
Productive not just receptiveHow to get such lists?
Method: ‘multiply’ complete generic family lists• Against particular small corpora or text collections
• Science corpus; general corpus; course materials• And reduce the generic list to just the inflections and derivations actually found in the
corpus• And even whole families
• This should be a useful list• Small• Useable for both production & reception
66
67
Size reductions at 1k
68
BNC-COCA 1k as FAMILIES 1,000 families 6,866 word types
BNC-Coca 1k as LEMMAS 998 lemmas 3,316 types 3316/6866=48%
Family x Brown corpus 984 fams 4,723 4723/6866=69%
Family x BNC-Medical 814 3,459 3459/6866=50%
Family x BNC-Law 505 2,569 2569/6866=37%
Family x BAWE-Engineering 712 3,283 3283/6866=48%
Graded story 996 2,442 2442/6866=36%
To noteMany of these reduced lists have fewer items (types) than the 1k lemma list
• But without loss of valuable items(All these items are IN the target corpus x number of occurrences)
And the reductions are even greater as we move away from the high-frequency 1k zone →
Size reductions at 4k
69
BNC-COCA 4k as FAMILIES 1,000 families 4,868 word types
BNC-Coca 4k as LEMMAS 998 lemmas 2,911 tokens 2911/4868=60%
Family x Brown corpus 817 fams 2,928 2928/4868=60%
Family x BNC-Medical 395 1,742 1742/4868=36%
Family x BNC-Law 505 648 648/4868=13%
Family x BAWE-Engineering 334 528 528/4868=11%
Graded story N/A N/A
Conclusion: [Fam v. Lem] or [Present v. Absent] ?
• Rather than starting from lemma list• With loss of valuable items• And guessing at what learners know and don’t know
• It is better to start from a complete family list• And reduce it to what is present in a corpus
• (properly chosen or designed)
• And teach learners the few derivations likely to be involved• While maintaining all the useful research that has already been done in
the family framework• And the tests• And the computer tools
70
Uses of nuclear lists?
• Students love lists• These are targeted
• Great concordance exercises
• Small size lends itself to flashcards
• With Text Lex Compare nuclear list can be compared to exam script• Exam should be no more than 2%
novel lexis
71
Applications to FrenchSome of these issues do not arise, because…
Family lists have never been developed for French• Even lemma lists have attracted only minor interest within
didactiques• Lextutor however does have a
• lemma-based French profiler and
• Lemma-level vocabulary size test
72
73
Le corpus utilisé par Lextutor
Corpus Liste
74
Lextutor has 25 French 1,000-lemma lists & calculates coverage
75
Le lexique divisés en groupes de 1,000 lemmes1k 2k 3k 4k 5k etc … →25k
76
Les testes de lataille ou du niveau
de vocabulairedes apprenants
• Exemple validé et publié(Le “TTV”, RCLV, 2016)
77
78
But the lists themselves get little use• “Didactique du français” sees little value in lists
• Because background of personnel?• Literature-oriented people find lists mechanical
• But they also find testing, CALL, etc mechanical• Corpus-oriented people are lemma oriented –
lemma is easy to compute• But lemma lists make no sense pedagogically
• Huge numbers of levels• Full of redundancy
• École is K1, écolier is K8
• So that once we get to about 5k there is less to learn in each ‘higher’ list ------------------→
• In most of these the original lower-k base word is present and obvious
79
Applications to French•Didactique du français has never seen much value in
lists• Because reading is rarely a predominant goal?
• No comparable stampede of overseas L2 learners to Francophone universities as in English• Where reading will be maker or breaker
• So the burning questions of coverage research• Make less sense
• Similarly the categorization of words it depends on
80
This could be about to change
• Gothenburg University (Sweden ) French Dept has invited me to help them develop a set of family lists• March 2020• Christina Lindquist & colleagues
• They want to develop a quantitative/computational approach to teaching French
• And find the lemma useless for ~• Teaching• Testing• Sequencing materials• i.e. Practice
81
In which case…All the same questions would arise that arose in English
• Do learners know words qua ~• Individual words• Lemma groups• Family groups ?
Ripe with opportunities for empirical research !
My suspicion?Derived forms are less of a problem in French than English
So French family lists are worth looking atAnd particularly the nuclear lists to be derived therefrom
Particularly to get rid of those long lists of little-used verb formsReferences→
82
83
When this is (pre-)published the link will be here :
lextutor.ca/research/ (Click ‘papers’)
Selected ReferencesCOVERAGE
• Laufer & Ravenhorst (2010). ‘Lexical threshold revisited: Lexical text coverage, learner’s vocabulary size and reading comprehension,’ Reading in a Foreign Language 22: 15–30.
• Schmitt, Jiang & Grabe (2011). ‘The percentage of words known in a text & reading comprehension,’ Modern Language Journal 95:26–43.
FAMILIES
• Bauer & Nation (1993). ‘Word families,’ International Journal of Lexicography 6: 253–79.
FAMILY CRITICS
• Brown (2018). ‘Examining the word family through word lists,’ Vocabulary Learning & Instruction 7: 51–65.
VOCABPROFILE
• https://www.lextutor.ca/vp/comp/
FAMI/LEMMATIZER
• https://www.lextutor.ca/familizer/
LEVELS TESTS
• https://www.lextutor.ca/tests/
COVERAGE CALCULATOR
• https://www.lextutor.ca/cover/
MORPHOLEX
• https://www.lextutor.ca/cgi-bin/morpho/lex/
NULCEAR LIST BUILDER
• https://www.lextutor.ca/freq/nuclear/
RESEARCH
• https://www.lextutor.ca/research/
84
https://www.lextutor.ca/vp/comp/https://www.lextutor.ca/familizer/https://www.lextutor.ca/tests/https://www.lextutor.ca/cover/https://www.lextutor.ca/cgi-bin/morpho/lex/https://www.lextutor.ca/freq/nuclear/https://www.lextutor.ca/research/