10
Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK. www.comp.lancs.ac.uk/ucrel/

Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK

Embed Size (px)

Citation preview

Page 1: Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK

Comparing Corpora using Frequency Profiling

Paul Rayson and Roger GarsideUCREL research group

Computing Department

Lancaster University, UK.

www.comp.lancs.ac.uk/ucrel/

Page 2: Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK

Comparing Corpora

Brown versus LOB (Hofland & Johansson, 1982)

Comparison at word form or annotation level

Information retrieval and extraction applications

Page 3: Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK

Two main types

Type 1: – sample corpus v. larger ‘standard’

normative corpus Type 2:

– two (roughly) equal sized corpora

Page 4: Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK

Main issues of concern

representativeness (balance) homogeneity within the corpora comparability of the corpora reliability of statistical tests

Page 5: Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK

Statistics

Chi-squared unreliable Mann-Whitney (Kilgarriff 1996) Log-likelihood (Dunning 1993)

Page 6: Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK

Method

O1 = a O2 = b N1 = c N2 = d

E1 = c*(a+b) / (c+d)

E2 = d*(a+b) / (c+d)

LL = 2*((a*log (a/E1)) + (b*log (b/E2)))

ii

iii

i N

ONE

i i

ii E

OO ln2ln2

Page 7: Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK

Application (REVERE)

Systems engineering application User interview transcripts, standards

documents, user manuals POS tagged with CLAWS Semantic analysis Wmatrix retrieval tool

– Frequency profiling and KWIC

Page 8: Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK

Air traffic control

Ethnographic studies at ATC centre– Verbatim transcripts of observations and

interviews with controllers– Unstructured reports – 103 pages

Page 9: Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK

Key semantic categories

Log-likelihood Semantic Word sense (and examples from the text)

tag

3366 S7.1 power, organising (‘controller’, ‘chief’)

2578 M5 flying (‘plane’, ‘flight’, ‘airport’)

988 O2 general objects (‘strip’, ‘holder’, ‘rack’)

643 O3 electrical equipment (‘radar’, ‘blip’)

535 Y1 science and technology (‘PH’)

449 W3 geographical terms (‘Pole Hill’, ‘Dish Sea’)

432 Q1.2 paper documents and writing (‘writing’, ‘written’, ‘notes’)

372 N3.7 measurement (‘length’, ‘height’, ‘distance’, ‘levels’, ‘1000ft’)

318 L1 life and living things (‘live’)

310 A10 indicating actions (‘pointing’, ‘indicating’, ‘display’)

306 X4.2 mental objects (‘systems’, ‘approach’, ‘mode’, ‘tactical’, ‘procedure’)

290 A4.1 kinds, groups (‘sector’, ‘sectors’)

Page 10: Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK

Conclusions

Method of comparing corpora using frequency profiling

Discovery of key items Human verification of hypotheses Applications in study of social

differentiation in the use of English vocabulary, profiling of learner English and IE in SE domain