21
The stumbling blocks in corpus- based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland PLM 2001 Bukowy Dworek 27 April 2001

The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

Embed Size (px)

Citation preview

Page 1: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

The stumbling blocks in corpus-based research of interlanguage phraseology

Przemysław Kaszubski

School of English

Adam Mickiewicz University

Poznań, Poland

PLM 2001 Bukowy Dworek27 April 2001

Page 2: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

Corpus linguistics: central problems

Representativeness (corpus design, compilation criteria, etc.)

Annotation (& disambiguation) of dataSome basic questions:

How much to annotate? (whole corpus? 1 part of speech, 1 lemma etc.)

How deep an analysis? How large is the corpus? What and whom are the results for? Corpus-based or corpus-driven procedures?

Page 3: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

Methodological premises of my research (1)

EFL learners’ overuse of high-frequency words: what does it mean? Intensive collocability of core lexical items Multi-word extensions (compounds, coinages, idioms,

expressions, phrasals)

Confrontation Available corpus-driven extraction methods

vs. pedagogical usefulness: L1-perspective (the role of

transfer)

Page 4: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

Methodological premises (2)

multi-corpus scheme with Polish advanced EFL learner data as hub data

variables: a) genre / text-type; b) L1; c) proficiency level d) age / maturity level

Lemma-based approach (as opposed to wordform- or family-oriented approaches)

Page 5: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

ENGLISH CORPORA

non-native English native English

‘apprentice’ corpora ‘expert’ corpora

1. Intermediate 2. Upper-intermediate

3. Advanced 4. College 5. Professional

Polishintermediate EFL

Spanish(upper-)

intermediate EFL

Belgian-French

advancedEFL

Polishadvanced

EFL

British and Americancollege learner

English

Britishacademic

writing

British andAmerican quality

press

PLLC SPAN FREN IFA-P(ICLE) LOCN(ARG) MCONC LOB&BROWN

92,712tokens

94,965tokens

101,442tokens

107,990tokens

106,255tokens

97,914tokens

94,421tokens

POLISH CORPORA

POL-STUD ‘apprentice’corpus

4. College level Polish college compositions 103,382tokens

POL-EXP ‘expert’ corpus 5. Professional level Polish academic papers + quality-press articles

101,348tokens

The corpus base: full specification

Page 6: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

The ‘extended’ tripartite idiomaticity model: the criteria

lexical fixednesssyntactic fixedness and / or anomalysemantic opacity lexicalisation / institutionalisation / specialisation

/ conventionality = frequency + distribution implementation of fourth criterion via external

sources BBI2 & LDOCE3

Page 7: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

The ‘extended’ tripartite idiomaticity model: the levels (1)

frozen expressions:phrasals: ‘TAKE after sb’; ‘TAKE to (doing) sth; ‘be taken

aback’; ‘GIVE (sth) up’; ‘GIVE sb/o.s. away’MWUs: ‘GIVE rise to’; ‘GIVE way to sb/sth’; ‘GIVE sb a

hand’; ‘TAKE care’; ‘TAKE place’; ‘TAKE for granted’; ‘TAKE advantage’; ‘TAKE root’; ‘TAKE effect’;

lexicalised compounds: ‘God-given’; ‘risk-taking’; ‘leave-taking’

restricted uses (1):restricted collocations & delexical uses: ‘TAKE drugs/ steps/

the form of/ advice/ decision/ initiative/ a bath/ a breath/ sleep’; ‘GIVE an account/ a lesson/ explanation/ sb/sth a name/ a concert/ permission/ a speech/ sb a warm welcome’

Page 8: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

The ‘extended’ tripartite idiomaticity model: the levels (2)

restricted uses (2):special senses or uses: ‘GIVE results/ details/ data’; ‘TAKE

<X minutes, year(s), months, hours, generations, life, etc.>’, TAKE <sth> to mean <sth>

discourse formulae ‘let's take X/an example of X/ X as an example etc.’

free combinations:regular (incl. transparent phrasals): ‘TAKE <sb> away/ to <a

place> etc.’; ‘GIVE <sth> back’; ‘GIVE money’curious interlanguage usage: ‘?GIVE generalisation/

stabilisation to <sth>’; ‘?TAKE help/ behaviour’

Page 9: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

The research hypotheses

negative correlation between proficiency level and frequencies of non-idiomatic uses

positive correlation between proficiency level and frequencies of idiomatic expressions except EFL learners’ ‘favourite expressions’

traceability of (at least) some ‘favourite expressions’ to L1

Page 10: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

Automatic extraction precision & recall problems

POS (part-of-speech) taggers’ error marginWord-sense disambiguation and / or syntactic

parsingCollocation statisticsNature of learner language Inter-corpus comparability

Page 11: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

Problem 1: error margin of POS taggers

Standard error margin: 5%Affected: extraction of lemmas meeting POS criteriaPrecision (noise in data): non-verbs tagged as verbs

• Not-telling VB(lex,montr,ingp) ?not-tel? ...(7)

• agressive VB(lex,intr,infin) ?agressive? ...(3)

• well-behaved VB(lex,montr,edp) ?well-behave? ...(2)

Recall (data ignored): verbs tagged as non-verbs / lexical verbs as auxiliaries:

• ... who in sharing their lives with a retarded sibiling [sic!] and taking <ADJ(ge,pos,ingp)> {taking} part in every-day care problems, may decide never to have ...

Page 12: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

Tracking & rectifying the POS errors

tagger built-in tag editor (TOSCA-ICLE): on-line targeting of precision & recall errors (UNTAGged and doubtful cases) Problem: insufficient query language: word OR lemma OR

tag pattern

no tagger built-in editor: concordancer or editor needed to test for precision and recall Problem: either comprehensive or intuitive check

remaining difficulty: tagsets vs. research assumptions (gerunds & participles tagged as non-verbs)

Page 13: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

Problem 2: semantic disambiguation and associations

sometimes only grouping data uncovers a meaningful type of association (Stubbs 1998:4)

automatic word-sense disambiguation (WSD) and machine-readable lexicons (e.g. WordNet 1.7, EuroWordNet): the Senseval Project

University of Lancaster disambiguation toolTools unavailable or not at implementable stage

Page 14: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

Problem 3: corpus-driven collocation extraction (1)

lemmas or wordforms?collocation vs co-occurrence (vs adjacency)

word clustersprecision: many identified clusters have little linguistic

significance (‘is the’; ‘of the’; ‘it BE a’)recall: Many genuine collocations and MWUs are not

contiguous (Kennedy 1998: 114) and may spill outside the typical 4:4 window (e.g. ‘TAKE care of...’ vs ‘TAKE good care of’; ‘the chance which were not eager to take’)

stop-listing not quite possible with high-frequency items (BUT: Ted Pedersen’s ‘Bigram Statistics Package’: http://www.d.umn.edu/~tpederse/code.html)

Page 15: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

Problem 3: corpus-driven collocation extraction (2)

co-occurrence statistics (WordSmith)precision: not all co-occurrence patterns testify to meaningful

collocationsrecall: collocations may extend beyond typical 4:4 word spansMI: mostly identifies ‘idiosyncratic collocations’ (Oakes 1998;

90):GIVE 172 2458 birth 4.65 vote 4.24 opening

4.24 antibiotic 4.01 vaccination 3.91 ingenuity3.91 isolate 3.43 habit 3.43 happiness

3.24 away 2.91

WordSmith: only 10 collocate outputOliver Mason’s QWICK: MI with weighting factors for frequent

words and unlimited display of collocates

Page 16: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

Problem 3: corpus-driven collocation extraction (3)

co-occurrence statistics (z-score: TACT)z-score & t-score - better suited for frequent collocates but also mutual

and imprecise on their own: z-score ordered collocate list for BE:

• there; it; that; able; not; which; should; considered; by; likely; to; said; very; enough; why; important; concerned; what; always; worth; if; proved; afraid; used;

Mason’s QWICK: multi-test package: incl. also log-likelihood; modified log likelihood; expected/observed ratio

Remaining problemsstop-listing not quite possible with high-frequency item testscollocations outside a heuristic window lexical associations between collocates (synsets)semi-manual grouping of data essential (limitations)

Page 17: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

Problem 4: the nature of learner data

Difference in proficiency levels essential in cross-corpus comparisons Recall: misspelled words may get mistagged by

taggers and overlooked by concordancers, unless edited beforehand

Wrong or inconsistent hyphenation may mislead taggers, e.g. ‘money making’ vs. ‘moneymaking’ vs. ‘money-making’

Unrecognised words vs. tagger default option tag

Page 18: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

Problem 5: cross-corpus comparability

genre homogeneity topic-skewed distribution: heuristic method of

isolation: sort by standard deviation

TAKE<sb/sth>

LOB&BR

MCONC LOCN IFA-P FREN SPAN PLLC SD

drugs* 0 0 2 43 6 0 0 15,90

steps 4 4 1 13 2 7 0 4,43

overdose 0 6 0 0 0 0 0 2,27

exercise 0 6 0 1 0 0 1 2,19

life** 0 0 6 1 1 0 0 2,19* incl. marijuana, opium, chemical substances** = kill

Page 19: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

Summary

Difficult to find/compile truly homogenous AND comparable sets of corpora = small corpus analysis often a necessity

With small corpora, mere automated methods of processing and analysis display insufficient precision and recall

Loss of data may be prove too costly when pedagogical conclusions are sought

Instead of automatisation: increase the pace of assisted pre-processing and semi-manual analysis (disambiguation)

Dedicated new type of hybrid concordancer-editor needed

Page 20: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

SOLUTION: dedicated concordancer-annotator

Feature 1: allow editing of concordance lines - text and/or tags and/or lemmas - like built-in tagger editors

Feature 2: allow adding custom information to concordance lines (specialised annotation / grouping of data)

Feature 3: allow saving concordances as text BACK into the corpus (pasting)

Feature 4: collocation annotation / statistics enhanced by links with phraseological dictionary

Feature 5: ???

Page 21: The stumbling blocks in corpus-based research of interlanguage phraseology Przemysław Kaszubski School of English Adam Mickiewicz University Poznań, Poland

PLM 2001, Bukowy Dworek

This show shortly available from:

http://main.amu.edu.pl/~przemka/rsearch.html