PowerPoint PresentationTitle PowerPoint Presentation Author Tom Cobb Created Date 12/1/2019 10:43:53 AM

https://lextutor.ca/nancy_present.pdf1

Word families vs. lemmas as the counting unit in text coverage research – summary of the debate and resolution

• The acquisition of vocabulary is primary in all aspects of language learning, and vocabulary is only manageable through computational analysis of spoken or written texts or corpora. This presentation will look at some issues in analysing vocabulary and learning to read in English.

Text analysis requires some sort of grouping of words, the two main grouping principles being the word family and the lemma. Families include inflected and derived forms (analyse and analysis), lemmas only inflected (analyse and analyses). The word family is a pedagogical extension of the lemma, and has been used extensively in testing and coverage research into the amount of vocabulary that must be known for various kinds of reading.

• Lately, however, this research has been challenged by supporters of the lemma, on the grounds that learners cannot be assumed to know all the derived forms of a typical word family. But do they have to? Are there enough derived forms in use in typical texts to affect the coverage research results? Morpholex is a text analysis program that was developed to answer these questions.

• The presentation will be mainly in and about English, but connections to French and learning French will be elaborated.

3

Familles de mots vs lemmes comme unité de comptage dans la recherche sur la couverture de texte - résumé du débat et résolution

• L'acquisition du vocabulaire est primordiale dans tous les aspects de l'apprentissage d'une langue, et le vocabulaire n'est gérable que par l'analyse informatique de textes ou de corpus parlés ou écrits. Cette présentation abordera quelques problèmes d’analyse du vocabulaire et d’apprentissage de la lecture en anglais.

• L'analyse de texte nécessite une sorte de regroupement de mots, les deux principaux principes de regroupement étant la famille de mots et le lemme. Les familles incluent les formes infléchies et dérivées (analyser et cet analyse), les lemmes seulement infléchis (analyse et analyses).

• La famille est une extension pédagogique du lemme et a été largement utilisé dans les tests et les recherches de couverture sur la …

• … quantité de vocabulaire à connaître pour différents types de lecture.

• Récemment, cependant, cette recherche a été contestée par les partisans du lemme, au motif que les apprenants ne peuvent pas être supposés connaître toutes les formes dérivées d’une famille de mots typique. Mais doivent-ils? Y a-t-il suffisamment de formes dérivées utilisées dans des textes typiques pour affecter les résultats de la recherche sur la couverture? Morpholex est un programme d'analyse de texte développé pour répondre à ces questions.

• La présentation portera principalement sur l'anglais et sur l'anglais, mais des liens avec le français et l'apprentissage du français seront développés. 4

Outlineof presentation

Definitions• Family• Lemma• Coverage

Family’s role in coverage research

The lemmatizers’ challenge• Their coverage methodology

• Retooled

Methodology

• Development of Morpholex

• Making of a mini-corpus

• Typical reading materials at 4 levels

Results

• What proportion of these texts are derived forms?

• How many individual derived forms are involved?

Next chapter

• Nuclear lists5

Background issues

Interpreting corpora in language pedagogy

Complementarity of corpus + empirical findings

Language teachability• And raising it through text analysis

Coding as research

6

Two kinds of groupings ~Family v. lemma

…expanded into lists of 1,000 (families or lemmas)7

8

a

an

able

ability

abler

ablest

ably

abilities

unable

inability

about

absolute

absolutely

absolutist

absolutists

accept

acceptability

acceptable

acceptably

unacceptable

acceptance

accepted

accepting

accepts

unacceptably

account

accounted

accounting

a

an

able

abler

ablest

about

absolute

absolutes

absolutest

accept

accepted

accepting

accepts

account

accounts

accounted

accounting

achieve

achieved

achieving

achieves

across

act

acts

acted

acting

active

actives

Lemma isBase (or head) word+ Inflections

Family isBase (or head) word+ Inflections+ Derivations

Derivation can involveChange of POS (able→ability)Change of meaning (able→unable)Big change to base word (able→ability)Few ‘rules’

Inflection none of these (able→abler)Hence ‘easier’

Where do all these variant forms come from?A corpusFrequency > x

1k byFam

1k byLem

9

1 2

Et en français ~Fr utilise toujours les lemmes(Though maybe not for long)

10

âgée though similar isat different k-level

école does not include écolier

économieand économiqueare both 1kbut 2 items

Why do we need groupings?

Computationally• Modern corpus is millions of individual words (‘tokens’)

• Impossible to manage individually

Pedagogically• If a learner knows “cat” there is no reason to treat “cats” as a

new word• But : “catty” ?

• Syntax: Different part of speech• Semantics: Rarely applied to cats themselves

• Ss may know cat and not really know catty

• It is a question of how much to include in the groupings11

Why do we need groupings?

• To profile the frequency of words in texts • in a clear and useful manner

• And discover the lexical challenge of different texts• Especially in conjunction

with vocabulary testsemploying the same measure

• What do you notice in this profile?

12

Coverage – the magic numbers 95 and 98

• Coverage = the extent to which a certain word list ‘covers’ (accounts for, contains) a given percentage of a text or corpus

• The coverage points of pedagogical interest have been determined by empirical (not computational) research :• Texts can be comprehended with resources when 95% of

individual word tokens are known• 95% typically corresponds to 5,000 families known

• Texts can be comprehended independently when 98% of the individual word tokens are known• 98% typically corresponds to 8,000 families known

• Those are the words learners need

13

Average text95% = c. 5,000 word fams98% = c. 8,000 word fams

So is this a ‘difficult’ or ‘easy’ text?

14

Average text95% = c. 5,000 word fams98% = c. 8,000 word fams

How many words do learners know?For this we use receptive family-k-level-based testing

16Typically % score at a level, x 1,000 families = learners’ receptive knowledge at that level

Now we can match ‘knows’ with ‘needs to know’ for particular texts (types)

17

→

For example

Suppose a learner’s score on VST is this:

1k=80%2k=70%3k=0

And the text he is reading profiles like this

80% at 1k10% at 2k10% at 3k

Typical situation

Then his 1k knowledge gives him 80% x 80% = 64% of the words in the text

And his 2k knowledge gives him 70% x 10% = 7% of the words in the text

So this learners is reading a text with

64% + 7% = 71% of its words known•

While research suggests 95% is minimally needed

• What does it feel like to read a text at 71% coverage? →

18

71% coverage is far from 95%

19

This research, however, is based on the family as unit of word-counting• It would

not be the same if we used the lemma as the counting unit

• Look at 2 profiles →for the same text (Rex M.)

• The diff. being(presumably) ± the derived word forms

20

As in text, so in a corpus

21

Fams & lems : Peda-Pros & Cons

FAMILY

• Whole family is together in one place• Lemma will put very similar

forms in widely separated lists• ADAPT k=3• ADAPTATION k=6• ADAPTABLE k=12

• Seems inefficient• Lemma underestimates the

learner

FAMILY

• A small number of fam. lists cover any text or even corpus• Exhaustive - every word will get

classified by the profiler• While lemma needs many lists to

profile a text• 50% more• With lots of redundancy• Probably unusable in practice

• But family overestimates the learner?

22

The issue: 100 of the best L2 reading studies in recent years…

• Are based on the word familyE.g., Nation, Laufer, Schmitt, Grabe,…

• But is it a convenience to use this unit or a principled decision?• Family = smaller set of lists; tidier computer output; easier for

practitioners to understand; matches common sense (1k=speech, >3k+=text, etc.)

23

A group of researchers, primarily in Japan…

• Working with learners who have little contact with English outside the classroom• And a very examination-driven

approach to language learning

• These researchers strongly dispute the use of the word family

• These researchers argue that no knowledge beyond the lemma can be assumed in their learners

• run runs running• runner × a run ×

• And therefore the coverage research, based on assumed word-family knowledge, does not describe most Japanese learners• Or many other learners

worldwide

24

→ A serious problem

Which has occupied vocab research conferences for past five years

25

The problem could be language interference

• L1-L2 differences in affixation• Japanese affixation is like compounding ?

• Both parts remain identifiable• English can twist the base word quite severely to add an affix

• Able-ability• Particularly affecting the pronunciation

• Pronounce – pronunciation• Such that base word becomes less identifiable

• French may be less problematic here• “L’accent tonique” means base word is not lost ?

• In orthography or pronunciation

• Fr: Science-scientifique

• Eng: ScIence-scIenTIFic 26

Fam v. Lem could have been just one more interminable debate…

Of the type we know so wellExcept that one of the Japan researchers, Dale Brown, pointed to a way forward

• Brown asked whether/how much derived forms are in fact used in texts• Hoping to show they are used a great deal

• To explain his learners’ weak reading ability

• Specifically, are derived forms needed to reach the 95% and 98% coverage points ?• If Yes, then the family-based coverage research does not apply to learners who know

only lemma forms• For the words they know at all

• A brilliant idea to measure word forms’ contributions to coverage• Except that Brown didn’t go all the way 27

What Brown didMétho ~For the first 5,000 head words in Nation’s BNC-Coca family lists ~• He took a random 100-head word sample from each 1,000

• (= 500 head words total)

• Then looked up all the forms for each family on online British National Corpus • Several look-ups for each family

• since BNC is lemmatized• … adding up the frequency figure for all inflected and derived forms in

each family

• (This must resemble Paul Nation’s original fleshing out of these lists as families – but for 500, not 25,000, words)

28

Example

30

Able 29657

Ability 9,054

Abilities 1,324

Abler 0

Ably 96

Unable 6,134

Inability 1,087

Inabilities 5

TOTAL 47,357

SoTotal word-forms 47,357Derived forms 17,700>(all but ‘able’ and ‘abler’)

Percent derived

Forms 37.37%

And so on for 500 random families

So if c.35% of a given family in the British National Corpus consists of derivations…

• Then a learner who knew all the words in the corpus• But only as lemmas, not as derivations• Would be reading this corpus with c.65% coverage

• And comprehension would be low

• But: do learners read a corpus ?• No

• Can what’s in a corpus be extrapolated to what’s in its constituent texts?• A good question• A clue to its answer is in another part of the BNC output →

31

BNC is c. 100 million word tokens

Comprising c. 4000 texts• in 100 text-types

• of c.25,000 words each

• Output tracks each search-term back to individual texts• So derived form ‘ability’ is in 2090/4048 = 52% of BNC’s texts

• So far, so good for Brown’s argument

32

But BNC also gives the distribution of these texts in different parts of the corpus

Ex, ‘Ability’

• Barely present in speech, or in fiction

• Fiction : 438 hits in >16 million wds

• News : 718 hits in > 9 million wds

• This raises the question

• Can whole-corpuscoverages be described as general?

• And specifically :Do they represent the type oftexts that ESL learners typically read?

33

How can we track derivations in texts that learners typically read?• Finding typical texts is no problem

• Beginners - graded readers

• Intermediates – novels and newspapers

• Advanced and non-native TESL trainees- Academic research articles

But how to count up the proportion of derived forms in these texts?• Specifically, how to determine if there are enough to undermine 95% and 98% coverage

for learners who know only lemmas

• (For the words they know)

34

Enter Morpholex

Formerly a minor Lextutor routine

• A “list profiler” • For cracking a family (or lemma) list into its base

words and various morphologies

• Example: here is k2 at Bauer and Nation Level 2 (with just inflected forms) →

35

Extended in 2019from list profiler to text profiler

37

38

To note :Count (# tokens) and coverage (% of tokens) for each level are given

Ex, Level 3: 12 derivs, 1.7% of text

Cumulative %’s are givenIndicating point where 95 and 98%are met/surpassed

Particular affixes are identified that were needed toreach 95 and 98%

Ex, Base + Inflect + 1 deriv affix (~ion) were needed to reach 95%

Baseword check

• A few errors • 6/695 wds (1%)

• Many derivations where it isnot the case that a basewordcould have been known without the affix

formidable - able = formid ?

39

Definitions & program design (=métho) Levels 1-7?

• These are Bauer & Nation’s (1993) framework for identifying morphology levels

• By frequency, transparency, regularity, degree of change imposed on the base word -‘difficulty’

LEVELS

• 1 = base words

• 2 = inflected forms

• 3-7 = derived forms

Total number in the B&N scheme : 100(not exhaustive)

The Morpholex organigram

Level by level, 1-7 ~

• Each word is matched against a list of B&N prefixes and affixes

• If affix found, asks: Is the remainder without this affix present in a list of all possible words?

• With adaptations like ‘stun’ → ‘stunn’ so that ‘stunning’ minus ‘~ing’ is a real word

• If Yes, it’s an inflected or derived form• If No, it’s a base word

• I.E. It’s considered a derived form onlyif learners could have recognized the base word if not extended into a derived form ***

40

The B&N framework (FYI – not really needed in this presentation)

41

Level 1 Base words

Level 2 Base words + inflections ('lemmas')

-s (on noun or verb), -ed/-ing (on verb), -er (er2)/-est (on adjective), -th (on number), and -en (en2) on irregular verb

-er2 and -en2 are named to separate them from verb+er in Level 3 and noun+en in Level 5

All the rest involve derivations (change in meaning and/or part of speech)

Level 3 Frequent and regular affixes with minimal change to the base word in speech or writing

-able/ible, -er/-or (on verb), -ish, -less, -ly, -ness, -th, -y,

non-, un-

Level 4 Frequent orthographically regular affixes which often impose pronunciation change (admIre => admirAtion

-al (autumnal), -ation (admiration), -ess (fortress), -ful (plentiful), -ism (dogmatism), -ist (semanticist), -ity (solemmnity

in-

Level 5 Less frequent but regular affixes

-age (leakage), -al (arrival), -ally (idiotically), -an (American), -ance (clearance), -ant (consultant), -ary (revolutionary),

anti- (anti-inflation), ante- (anteroom), arch- (archbishop), bi- (biplane), circum- (circumnavigate), counter- (counter

Level 6 Frequent but irregular affixes (often with significant change to base word)

-able (inscrutable), -ee (lessee), -ic (spastic), -ify (mollify), -ion (superstition), -ist (solipsist), -ition (transition), -ive

pre-, re-

Level 7 Classical affixes

-ar (circular), -ate (electorate), -et (packet, casket), -some (troublesome), -ure (departure, exposure)

ab- (abnormal), ad- (admixture), com- (commiserate), de- (demist), dis- (disintegrate), ex- (out - external), in-(in -

Métho (2)

Mini-corpus of 250,000 words

5 instances each of 5 text types, - run individually through Morpholex

• Applied linguistics articles (5)• ‘Quality’ news stories (5)• Classic novels (5)• Simplified novels (7)

42

Results

How manyderived formsby text type?

43

Summary 1

So it seems extensive knowledge of derivations is needed only for academic and quality press• While classic literature needs less

• Derivations are a little over 5% of word in texts• So almost 95% is just base words and inflections

• And graded (simplified) stories virtually none• Base and Inflections alone get us to 95, often to 98

A rather optimistic picture• And it gets better when we count up the number of different individual

affixes (types) →

47

Results (2)

How many dif-ferent affix types?

Even in academictexts, knowledge of just 3 affixes gets learner to 95%

Another 9to 98%

48

Results (2)

News texts similar

Just 3 affixes gets learner to 95%

And another 6affixes to 98%

49

Results (2)

Novels even better

Zero affixes gets learner to 95%


50

Results (2)

Graded novelsbetter still

Zero affixes gets learner to 95%


With average for all →text types 2 and 6

51

Summary 2

• A very small number of derived forms (affixes) is needed across the types to reach 95%• (= “comprehension with resources” - surely the typical situation of a learner)

• And a manageable number to reach 98%

• And it gets even better if we look at the repetition of particular affixes →

52

53

1. CommonalitiesResults (3)– Distribution of affixes across text types

54

2. SpecificitiesResults (3)– Distribution of affixes across text types

Overall picture is of a small handful of affixations in common use

(First 18 of 36) →

Just 8 are > 80% of totalJust 12 are > 90%Just 17 are > 95%

55

(All 36) →

A further 18 affixesmake up the remaining5% of all affixations

56

Summary 3

A small handful of affixations form the vast majority

• Are easily within what an ESL learner can cope with

• Once the affixes identified

• A slightly larger but still manageable number of affixes form the rest

• But are little used

• Forming fewer than 5% of affixations

57

Summary 3b

Brown’s picture of impossible texts with 35% of words unknowable to learners with only lemmas…

is vastly exaggerated

Many texts are available with almost no derived forms

• And any text is accessible with minor preparation• = learning or being taught just 5-10 affixes, even in

academic texts• Which will be re-encountered in whatever texts are presented

thereafter• And would not need to be re-learned

58

Summary 3c

Tests yet againSo family-test scores are reliable predictions of text coverage for test-takers who know only lemmas

• So a score of 80% at 1k means that 80%x70%=56% of words are known at that level????• (i.e., with adjustment of 30% derivatives subtracted)

No!• it means that 80% x 95%=76% of words are known at that level

• (with 5% derivatives subtracted)• And fewer in many text types

59

Summary 3dAnd the existing family-based coverage research is safe, valid, and worth carrying forward

This research is family based, but ~

• The vast majority of derived forms in a given family • While present in a 100 million word corpus of 4,000 texts in 100 text types• + needed so that every word in a VP gets categorized

• Will not be a significant component of particular texts that we might put in front of learners • Derivations will be well under 5% in most cases and under 2% in many• And any effort invested in learning these will be well repaid

• Since they are few in number and much repeated

• So it is a tale of many family members that we don’t see very often• Except when amassed in a corpus 60

Summary 3eIt is worth mentioning that Paul Nation long ago proposed that lemma and family are not an opposition

just different stages in word knowledge

It is kind of an accident that the coverage work was done exclusively with families

A parallel strand of coverage work could easily use the lemmaIt’s just that it wasn’t done→ A kind of oversight

61

How surprising is any of this?

English has been losing its morphology for centuries

Nominative, accus-ative, genitive, etc.

Inflections→

62

63

Derivations→

A topic foranother day

So - is there a reduction in derivations over time?Four non-literary texts over the centuries

Could be a trend

64

Further work

So with all that, where does this work go next?

• As well as supporting the future of the coverage research, it has some future of its own• To do with →unused family members• And developing word lists that accommodate these

• The purpose of exhaustive lists is clear• But surely more targeted lists would also have their uses

• Or call them ‘nuclear’ lists

We saw that

• While many affixations exist across text types…• ~ly, un~

• …others form ‘a profile’ specific to particular text types• ~ful, ~y & ~ness in FICTION

wonderful, beautiful; dreamy, rainy; happiness, gladness • ~al & ~age in ACADEMIC/NEWS

societal, governmental, culturalusage, coverage, percentage

65

Toward non-exhaustive or ‘nuclear’ lists

These too would have their usesThe word forms most used

Productive not just receptiveHow to get such lists?

Method: ‘multiply’ complete generic family lists• Against particular small corpora or text collections

• Science corpus; general corpus; course materials• And reduce the generic list to just the inflections and derivations actually found in the

corpus• And even whole families

• This should be a useful list• Small• Useable for both production & reception

66

Size reductions at 1k

68

BNC-COCA 1k as FAMILIES 1,000 families 6,866 word types

BNC-Coca 1k as LEMMAS 998 lemmas 3,316 types 3316/6866=48%

Family x Brown corpus 984 fams 4,723 4723/6866=69%

Family x BNC-Medical 814 3,459 3459/6866=50%

Family x BNC-Law 505 2,569 2569/6866=37%

Family x BAWE-Engineering 712 3,283 3283/6866=48%

Graded story 996 2,442 2442/6866=36%

To noteMany of these reduced lists have fewer items (types) than the 1k lemma list

• But without loss of valuable items(All these items are IN the target corpus x number of occurrences)

And the reductions are even greater as we move away from the high-frequency 1k zone →

Size reductions at 4k

69

BNC-COCA 4k as FAMILIES 1,000 families 4,868 word types

BNC-Coca 4k as LEMMAS 998 lemmas 2,911 tokens 2911/4868=60%

Family x Brown corpus 817 fams 2,928 2928/4868=60%

Family x BNC-Medical 395 1,742 1742/4868=36%

Family x BNC-Law 505 648 648/4868=13%

Family x BAWE-Engineering 334 528 528/4868=11%

Graded story N/A N/A

Conclusion: [Fam v. Lem] or [Present v. Absent] ?

• Rather than starting from lemma list• With loss of valuable items• And guessing at what learners know and don’t know

• It is better to start from a complete family list• And reduce it to what is present in a corpus

• (properly chosen or designed)

• And teach learners the few derivations likely to be involved• While maintaining all the useful research that has already been done in

the family framework• And the tests• And the computer tools

70

Uses of nuclear lists?

• Students love lists• These are targeted

• Great concordance exercises

• Small size lends itself to flashcards

• With Text Lex Compare nuclear list can be compared to exam script• Exam should be no more than 2%

novel lexis

71

Applications to FrenchSome of these issues do not arise, because…

Family lists have never been developed for French• Even lemma lists have attracted only minor interest within

didactiques• Lextutor however does have a

• lemma-based French profiler and

• Lemma-level vocabulary size test

72

Le corpus utilisé par Lextutor

Corpus Liste

74

Lextutor has 25 French 1,000-lemma lists & calculates coverage

75

Le lexique divisés en groupes de 1,000 lemmes1k 2k 3k 4k 5k etc … →25k

76

Les testes de lataille ou du niveau

de vocabulairedes apprenants

• Exemple validé et publié(Le “TTV”, RCLV, 2016)

77

But the lists themselves get little use• “Didactique du français” sees little value in lists

• Because background of personnel?• Literature-oriented people find lists mechanical

• But they also find testing, CALL, etc mechanical• Corpus-oriented people are lemma oriented –

lemma is easy to compute• But lemma lists make no sense pedagogically

• Huge numbers of levels• Full of redundancy

• École is K1, écolier is K8

• So that once we get to about 5k there is less to learn in each ‘higher’ list ------------------→

• In most of these the original lower-k base word is present and obvious

79

Applications to French•Didactique du français has never seen much value in

lists• Because reading is rarely a predominant goal?

• No comparable stampede of overseas L2 learners to Francophone universities as in English• Where reading will be maker or breaker

• So the burning questions of coverage research• Make less sense

• Similarly the categorization of words it depends on

80

This could be about to change

• Gothenburg University (Sweden ) French Dept has invited me to help them develop a set of family lists• March 2020• Christina Lindquist & colleagues

• They want to develop a quantitative/computational approach to teaching French

• And find the lemma useless for ~• Teaching• Testing• Sequencing materials• i.e. Practice

81

In which case…All the same questions would arise that arose in English

• Do learners know words qua ~• Individual words• Lemma groups• Family groups ?

Ripe with opportunities for empirical research !

My suspicion?Derived forms are less of a problem in French than English

So French family lists are worth looking atAnd particularly the nuclear lists to be derived therefrom

Particularly to get rid of those long lists of little-used verb formsReferences→

82

83

When this is (pre-)published the link will be here :

lextutor.ca/research/ (Click ‘papers’)

Selected ReferencesCOVERAGE

• Laufer & Ravenhorst (2010). ‘Lexical threshold revisited: Lexical text coverage, learner’s vocabulary size and reading comprehension,’ Reading in a Foreign Language 22: 15–30.

• Schmitt, Jiang & Grabe (2011). ‘The percentage of words known in a text & reading comprehension,’ Modern Language Journal 95:26–43.

FAMILIES

• Bauer & Nation (1993). ‘Word families,’ International Journal of Lexicography 6: 253–79.

FAMILY CRITICS

• Brown (2018). ‘Examining the word family through word lists,’ Vocabulary Learning & Instruction 7: 51–65.

VOCABPROFILE

• https://www.lextutor.ca/vp/comp/

FAMI/LEMMATIZER

• https://www.lextutor.ca/familizer/

LEVELS TESTS

• https://www.lextutor.ca/tests/

COVERAGE CALCULATOR

• https://www.lextutor.ca/cover/

MORPHOLEX

• https://www.lextutor.ca/cgi-bin/morpho/lex/

NULCEAR LIST BUILDER

• https://www.lextutor.ca/freq/nuclear/

RESEARCH

• https://www.lextutor.ca/research/

84
https://www.lextutor.ca/vp/comp/https://www.lextutor.ca/familizer/https://www.lextutor.ca/tests/https://www.lextutor.ca/cover/https://www.lextutor.ca/cgi-bin/morpho/lex/https://www.lextutor.ca/freq/nuclear/https://www.lextutor.ca/research/

Documents

PowerPoint PresentationTitle PowerPoint Presentation Author Tom Cobb Created Date 12/1/2019 10:43:53 AM