Upload
hillary-montgomery
View
214
Download
1
Embed Size (px)
Citation preview
Comparing Corpora using Frequency Profiling
Paul Rayson and Roger GarsideUCREL research group
Computing Department
Lancaster University, UK.
www.comp.lancs.ac.uk/ucrel/
Comparing Corpora
Brown versus LOB (Hofland & Johansson, 1982)
Comparison at word form or annotation level
Information retrieval and extraction applications
Two main types
Type 1: – sample corpus v. larger ‘standard’
normative corpus Type 2:
– two (roughly) equal sized corpora
Main issues of concern
representativeness (balance) homogeneity within the corpora comparability of the corpora reliability of statistical tests
Statistics
Chi-squared unreliable Mann-Whitney (Kilgarriff 1996) Log-likelihood (Dunning 1993)
Method
O1 = a O2 = b N1 = c N2 = d
E1 = c*(a+b) / (c+d)
E2 = d*(a+b) / (c+d)
LL = 2*((a*log (a/E1)) + (b*log (b/E2)))
ii
iii
i N
ONE
i i
ii E
OO ln2ln2
Application (REVERE)
Systems engineering application User interview transcripts, standards
documents, user manuals POS tagged with CLAWS Semantic analysis Wmatrix retrieval tool
– Frequency profiling and KWIC
Air traffic control
Ethnographic studies at ATC centre– Verbatim transcripts of observations and
interviews with controllers– Unstructured reports – 103 pages
Key semantic categories
Log-likelihood Semantic Word sense (and examples from the text)
tag
3366 S7.1 power, organising (‘controller’, ‘chief’)
2578 M5 flying (‘plane’, ‘flight’, ‘airport’)
988 O2 general objects (‘strip’, ‘holder’, ‘rack’)
643 O3 electrical equipment (‘radar’, ‘blip’)
535 Y1 science and technology (‘PH’)
449 W3 geographical terms (‘Pole Hill’, ‘Dish Sea’)
432 Q1.2 paper documents and writing (‘writing’, ‘written’, ‘notes’)
372 N3.7 measurement (‘length’, ‘height’, ‘distance’, ‘levels’, ‘1000ft’)
318 L1 life and living things (‘live’)
310 A10 indicating actions (‘pointing’, ‘indicating’, ‘display’)
306 X4.2 mental objects (‘systems’, ‘approach’, ‘mode’, ‘tactical’, ‘procedure’)
290 A4.1 kinds, groups (‘sector’, ‘sectors’)
Conclusions
Method of comparing corpora using frequency profiling
Discovery of key items Human verification of hypotheses Applications in study of social
differentiation in the use of English vocabulary, profiling of learner English and IE in SE domain