66
Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK.

Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Embed Size (px)

Citation preview

Page 1: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Dataset Profiling

Anne De Roeck

Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite

Centre for Research in ComputingThe Open University, Walton Hall,

Milton Keynes, MK7 6AA, UK.

Page 2: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Fact or Factoid: Hyperlinks

• Hyperlinks do not significantly improve recall and precision in diverse domains, such as the TREC test data (Savoy and Pickard 1999, Hawking et al 1999).

Page 3: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Fact or Factoid: Hyperlinks

• Hyperlinks do not significantly improve recall and precision in diverse domains, such as the TREC test data (Savoy and Pickard 1999, Hawking et al 1999).

• Hyperlinks do significantly improve recall and precision in narrow domains and Intranets (Chen et al 1999, Kruschwitz 2001).

Page 4: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Fact or Factoid: Stemming

• Stemming does not improve effectiveness of retrieval (Harman 1991)

Page 5: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Fact or Factoid: Stemming

• Stemming does not improve effectiveness of retrieval (Harman 1991)

• Stemming improves performance for morphologically complex languages (Popovitch and Willett 1992)

Page 6: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Fact or Factoid: Stemming

• Stemming does not improve effectiveness of retrieval (Harman 1991)

• Stemming improves performance for morphologically complex languages (Popovitch and Willett 1992)

• Stemming improves performance on short documents (Krovetz 1993)

Page 7: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Fact or Factoid: Long or Short.

• Stemming improves performance on short documents (Krovetz 1993)

• Short keyword based queries behave differently from long structured queries (Fujii and Croft 1999)

• Keyword based retrieval works better on long texts (Jurawsky and Martin 2000)

Page 8: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Assumption

• Successful (statistical?) techniques can be successfully ported to other languages.– Western European languages– Japanese, Chinese, Malay, …

Page 9: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Assumption

• Successful (statistical?) techniques can be successfully ported to other languages.– Western European languages– Japanese, Chinese, Malay, …

• WordSmith: Effective use requires 5M word corpus (Garside 2000)

Page 10: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Type to Token ratio

Text length Bengali(CIIL)

English(Brown)

Arabic(Al-Hayat)

100 1.204 1.449 1.190

1600 2.288 2.576 1.774

6400 3.309 4.702 2.357

16000 4.663 5.928 2.771

20000 5.209 6.341 2.875

1000000 10.811 20.408 8.252

Page 11: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Fact

• Performance of IR and NLP techniques depends on the characteristics of the dataset.

Page 12: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Fact

• Performance of IR and NLP techniques depends on the characteristics of the dataset.

• Performance will vary with task, technique and language

Page 13: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Cargo Cult Science?

• Richard Feynman (1974)

Page 14: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Cargo Cult Science?

• Richard Feynman (1974)

“It's a kind of scientific integrity, a principle of scientific thought that corresponds to a kind of utter honesty--a kind of leaning over backwards. For example, if you're doing an experiment, you should report everything that you think might make it invalid--not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you've eliminated by some other experiment, and how they worked--to make sure the other fellow can tell they have been eliminated.”

Page 15: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Cargo Cult Science?

• Richard Feynman (1974)

“Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can--if you know anything at all wrong, or possibly wrong--to explain it.”

“In summary, the idea is to give all of the information to help others to judge the value of your contribution; not just the information that leads to judgement in one particular direction or another.”

Page 16: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Cargo Cult Science?

• The role of data in the outcome of experiments must be clarified

• Why?

• How?

Page 17: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Why Profile Datasets?

• Methodological: Replicability– Barbu and Mitkov (2001) – Anaphora resolution– Donaway et al (2000) – Automatic Summarisation

Page 18: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Why Profile Datasets?

• Methodological: Replicability– Barbu and Mitkov (2001) – Anaphora resolution– Donaway et al (2000) – Automatic Summarisation

• Epistemological: Theory induction – What is the relationship between dataset

properties and technique performance?

Page 19: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Why Profile Datasets?

• Methodological: Replicability– Barbu and Mitkov (2001) – Anaphora resolution– Donaway et al (2000) – Automatic Summarisation

• Epistemological: Theory induction – What is the relationship between dataset

properties and application performance?

• Practical: Application– What is relationship between two datasets?– What is this dataset (language?) like?

Page 20: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Why Profile Datasets?

• And by the way, the others think it is vital.(Machine Learning, Data Mining, Pattern Matching

etc.)

Page 21: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Why Profile Datasets?

• And by the way, the others think it is vital.(Machine Learning, Data Mining, Pattern Matching

etc.)

• And so did we! (or do we?)

Page 22: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Profiling: An Abandoned Agenda?

• Sparck-Jones (1973) “Collection properties influencing automatic term classification performance.” Information Starage and Retrieval. Vol 9

• Sparck-Jones (1975)“A performance Yardstick for test collections.” Journal of Documentation. 31:4

Page 23: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Profiling: An Abandoned Agenda

• Term weighting formula tailored to query– Salton 1972

• Stop word identification relative to collection/query– Wilbur & Sirotkin1992; Yang & Wilbur 1996

• Effect of collection homogeneity on language model quality– Rose & Haddock 1997

Page 24: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

What has changed?

• Proliferation of (test) collections • More data per collection• Increased application need

Page 25: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

What has changed?

• Proliferation of (test) collections • More data per collection• Increased application need

• Better (ways of computing) measures?

Page 26: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

What has changed?

• Sparck-Jones (1973)

– Is a collection useably classifiable?– Number of query terms which can be used for

matching.

– Is a collection usefully classifiable?– Number of useful, linked terms in document or

collection

– Is a collection classifiable?– Size of vocabulary and rate of incidence

Page 27: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,
Page 28: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Profiling Measures

• Requirements: measures should be– relevant to NLP techniques– fine grained– cheap to implement

Page 29: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Profiling Measures

• Requirements: measures should be– relevant to NLP techniques– fine grained– cheap to implement(!)

• Simple starting point:– Vital Statistics

Page 30: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Description

Contents of the documents

AP Copyrighted AP Newswire stories from 1989.

DOE Short abstracts from the Department of Energy.

FR Issues of the Federal Register (1989), reporting source actions by government agencies.

PAT U.S. Patent Documents for the years 1983-1991.

SJM Copyrighted stories from the San Jose Mercury News (1991).

WSJ Copyrighted stories from the Wall Street Journal (1987-1989).

ZF Information from the Computer Select disks for 1989/1990, copyrighted by Ziff-Davis Publishing Co.

OU The Open University intranet and extranet web-pages.

Page 31: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Vital Stats

Data Set

No of Docs

Corpus Length Av. Doc Length

No of Terms

Av.Term per Doc

Sht Doc

Long Doc.

AP 242,918 114,438,101 471.1 347,966 238.25 9 2,944

DOE 226,086 26,882,774 119.0 179,310 72.90 1 373

FR 45,820 62,805,175 1,370.7 157,313 292.65 2 387,476

PAT 6,711 32,151,785 4,790.91 146,943 653.05 73 74,964

SJM 90,257 39,546,073 438.15 178,571 223.60 21 10,393

WSJ 98,732 41,560,108 420.94 159,726 204.26 7 7,992

ZF 293,121 115,956,732 395.59 295,326 168.42 19 75,030

OU 53,681 39,807,404 744.36 304,468 219.87 1 15,430

Page 32: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Profiling Measures

• Requirements: measures should be– relevant to NLP techniques– fine grained– cheap to implement(!)

• Simple starting point:– Vital Statistics– Zipf (sparseness; ideosyncracy)

Page 33: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Zipf Curve - Bengali CIIL corpus

Page 34: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Profiling Measures

• Requirements: measures should be– relevant to NLP techniques– fine grained– cheap to implement(!)

• Simple starting point:– Vital Statistics– Zipf (sparseness; ideosyncracy)– Type to token ratio (sparseness, specialisation)

Page 35: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Type to Token Ratios

Length of text

AP DOE FR PAT SJM WSJ ZF TIPSTER OVERALL

100 1.333 1.515 1.492 1.315 1.428 1.282 1.47 1.405200 1.626 1.562 1.666 1.538 1.612 1.55 1.68 1.605400 1.877 1.762 2.051 2.259 1.869 1.886 1.941 1.949800 2.144 2.067 2.572 3.065 2.035 2.072 2.305 2.323

1600 2.797 2.315 3.047 4.266 2.476 2.584 2.758 2.8923200 3.062 2.824 3.841 5.169 3.013 3.225 3.285 3.4886400 3.561 3.575 5.437 6.009 3.557 3.83 4.238 4.315

16000 4.563 4.737 8.583 9.744 4.153 4.566 5.289 5.94820000 4.972 5.196 9.199 11.031 4.463 4.988 5.383 6.462

100000 9.14 10.451 15.453 20.764 8.463 9.413 12.017 12.2431000000 30.573 30.157 50.571 62.637 26.377 30.909 38.105 38.476

10000000 106.845 94.778 144.866 134.017 102.149 116.183 121.798 117.234

Page 36: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Type to Token Ratios

TIPSTER OVERALL

OU Bengali Arabic Brown Corpus

Length of text

1.405 1.47 1.204819 1.19 1.449 1001.605 1.694 1.388889 1.342 1.613 2001.949 2.247 1.67364 1.423 2.424 4002.323 2.622 1.864802 1.578 2.439 8002.892 3.053 2.288984 1.774 2.576 16003.488 3.673 2.775369 2.082 3.674 32004.315 4.312 3.309204 2.357 4.702 64005.948 6.24 4.663363 2.771 5.928 160006.462 6.944 5.20969 2.875 6.341 20000

12.243 12.41 6.074628 10000038.476 36.127 10.81093 8.252 20.408 1000000

117.234 82.064 10000000

Page 37: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Profiling Measures

• Requirements: measures should be– relevant to NLP techniques– fine grained– cheap to implement(!)

• Simple starting point:– Vital Statistics– Zipf (sparseness; ideosyncracy)– Type to token ratio (sparseness, specialisation)– Manual sampling (quality; content)

Page 38: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,
Page 39: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Profiling by Measuring Heterogeneity

• Homogeneity Assumption– Bag of Words– Function word distribution– Content word distribution

• Measure of Heterogeneity as dataset profile– Measure distance between corpora– Identify genre

Page 40: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Heterogeneity Measures

2 (Kilgariff 1997; Rose & Haddock 1997)• G2 (Rose & Haddock 1997; Rayson &

Garside 2000 )• Correlation, Mann-Whitney (Kilgariff 1996)• Log-likelihood (Rayson & Garside 2000)• Spearman’s S (Rose & Haddock 1997) • Kullback-Leibler divergence (Cavaglia 2002)

Page 41: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Kilgariff’s Methodology

• Divide corpus using 5000 word chunks in random halves

• Frequency list for each half• Calculate 2 for term frequency distribution

differences between halves• Normalise for corpus length• Iterate over successive random halves

Page 42: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Kilgariff’s Findings

• Registers values of 2 statistic• High value indicates high heterogeneity• Finds high heterogeneity in all texts

Page 43: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Defeating the Homogeneity Assumption

• Assume word distribution is homogeneous (random)

• Kilgariff methodology• Explore chunk sizes

– Chunk size 1 -> homogeneous (random)– Chunk size 5000 -> heterogeneous (Kilgariff 1997)

2 test (statistic + p-value)– Defeat assumption with statistical relevance

• Focus on frequent terms (!)

Page 44: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Homogeneity detection at a level of statistical significance

• p-value: evidence for/against the hypothesis– < 0.1 -- weak evidence against – < 0.01 -- strong evidence against– < 0.001 -- very strong evidence against– < 0.05 -- significant (moderate evidence against

the hypothesis)

• Indication of statistically significant non-homogeneity

Page 45: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Frequent Term Distribution

• Lots of them• Reputedly “noise-like” (random?)• Present in most datasets (comparison)• Cheap to model

Page 46: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Dividing a Corpus

• docDiv: place documents in random halves– term distribution across documents

• halfdocDiv: place half documents in random halves– term distribution within the same document

• chunkDiv: place chunks (between 1 and 5000 words) in random halves– term distribution between text chunks (genre?)

Page 47: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Results DocDiv

Number of Terms (N)Dataset10 20 50 100 500 1000 7000 20000

AP 2.1070.1216

1.5760.2139

2.5830.0003

2.2900

2.7320

2.6010

2.4410

2.4350

DOE 1.1720.463

1.4500.160

1.7550.0259

1.9830

1.8380

1.7860

1.7950

1.8720

FR 54.5240

41.7150

72.0930

66.7870

51.3870

61.2660

39.0430

23.5340

PAT 21.0740

29.3150

62.4940

55.3530

50.2650

44.8240

32.0560

22.4680

SJM 3.5950.1193

2.7680.0077

3.2310

2.9760

3.0120

2.9590

2.5600

2.5110

WSJ 2.3580.178

2.6630.0019

2.3640

2.3350

2.6230

2.7490

2.8310

2.9170

ZF 11.9470

8.1330

6.9070

6.5760

6.1220

5.6340

4.5950

4.5760

OU 232.9130

158.5200

94.7490

67.2930

32.6630

25.1810

14.2240

8.2970

Page 48: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Results HalfDocDiv

Number of Terms (N)Dataset10 20 50 100 500 1000 7000 20000

AP 1.7740.087

1.4730.117

1.3690.057

1.2710.066

1.1710.021

1.1870.0001

1.1470

1.1360

DOE 0.7280.655

0.9310.533

1.0540.438

1.0430.372

1.0610.195

1.0270.285

1.0140.271

1.010.182

FR 7.9050.001

9.5490

11.6270

11.6420

8.8470

8.1660

6.5430

5.3360

PAT 20.3600

15.5680

16.0170

11.8860

7.6940

6.2430

5.1020

4.6110

SJM 1.3230.3860

1.5690.3919

1.3200.4436

1.4690.1069

1.3320

1.2970

1.2400

1.2420

WSJ 1.5630.279

1.6180.248

1.3420.203

1.2980.260

1.2360.017

1.2100.0007

1.1780

1.1500

ZF 1.9480.1288

1.8580.116

1.7090.0283

1.6090.0240

1.5590

1.5980

1.5360

1.5560

OU 7.7210.033

6.1030.0025

8.0910

8.2160

6.3660

5.5020

4.2230

3.0870

Page 49: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Results ChunkDiv (5)

Number of Terms (N)Dataset10 20 50 100 500 1000 7000 20000

AP 0.6280.7516

0.8360.6375

0.8710.677

0.9840.484

0.9900.535

1.0070.523

1.0180.1595

1.0120.179

DOE 1.1410.3946

1.2250.3461

1.1510.2505

1.0500.3540

1.0380.4229

1.0020.462

1.0080.431

1.0080.3667

FR 0.7540.650

0.9610.504

0.9670.54

1.0330.405

1.0160.4174

1.0250.335

1.0220.2281

1.0130.211

PAT 1.2840.2451

1.4570.091

1.2550.2273

1.1530.1862

1.0510.226

1.0070.429

1.0080.330

1.0200.077

SJM 1.2040.429

1.1750.375

1.2260.293

1.1270.268

0.9790.608

1.0040.454

1.0120.262

1.0100.181

WSJ 0.8340.573

1.0080.492

0.7780.822

0.9240.679

0.9570.682

0.9840.6202

1.0000.498

1.010.252

ZF 0.8610.5781

0.7910.704

0.9390.636

0.9130.703

0.9940.525

1.0120.394

1.0070.393

1.0160.1258

OU 1.2420.3395

1.2570.271

1.1650.234

1.0230.424

1.0810.118

1.0540.142

1.0420.034

1.0330.005

Page 50: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Results: ChunkDiv (100)

Number of Terms (N)Dataset10 20 50 100 500 1000 7000 20000

AP 0.8240.6023

1.1050.3560

1.4120.0735

1.6070.0019

1.4710

1.3720

1.30040

1.30260

DOE 1.1020.3937

1.8640.0280

1.6460.0231

1.5110.0317

1.3540.0299

1.4140

1.40130

1.4240

FR 1.0060.5071

1.4410.229

1.6080.076

1.8030.025

1.9240

1.8340

1.7820

1.7460

PAT 4.1810.0232

3.0510.0025

2.6820.0007

2.4200

2.2520

2.1040

1.9770

1.8760

SJM 0.9950.4720

1.1170.3851

1.1460.3203

1.1800.2463

1.4100

1.4020

1.3170

1.2910

WSJ 1.1120.3741

1.2130.324

1.1980.2426

1.2300.0937

1.1960.0383

1.2830

1.29020

1.3190

ZF 1.5760.4152

1.2830.366

1.7090.011

2.1900

1.410

1.6730

1.3150

1.8840

OU 6.2310.0004

5.6570

4.8700

4.2780

3.3100

2.7330

2.2610

1.8650

Page 51: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Results

• docDiv:– heterogeneity across most documents, except:

• AP and DOE (20 terms or fewer)

• halfdocDiv:– tests sensitive to certain document types

• DOE very homogeneous• PAT and OU very heterogeneous

• chunkDiv:– chunk length vs. document boundary?– similar behaviour of WSJ and SJM

Page 52: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Pro

• Heterogeneity test reasonable profiling measure– sensitive to document types

• eg. different behaviour for halfdocDiv

– cheap to implement– relation between measure and p-value

• Intranet data gives extreme results– How transferable is corpus training?

Page 53: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Drawbacks

• Frequency based• Coarse grained• Not homogeneous = bursty

• Bursty in what way?

• Useful for applications?

Page 54: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Profiling by Measuring Burstiness

• Pioneer’s agenda: Clumps!– Sparck-Jones & Needham 1964

• Models– Poisson– Two poisson (Church 2000)– K-mixtures (Katz 1996)

• Count words

Page 55: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Burstiness Model

• Model gaps (not term occurrence)

• Mixture of exponential distributions

• Between-burst (1/1, or 1’)

• Within-burst (1/2 or 2’)

Page 56: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Burstiness Model

• First occurrence• No occurrence: censoring

Page 57: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Burstiness Model

• Baysian estimation– posterior prior x likelihood– choose uninformative prior– estimate posterior using Gibbs Sampling (MCMC)– WinBUGS software– 1000 iteration burn-in– further 5000 iterations for estimate

Page 58: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Burstiness Model

• Word behaviours– Small 1’, small 2’: frequently occurring function word

– Large 1’, small 2’: bursty content word

– Small 1’, large 2’: frequent but well spaced function word

– Large 1’, large 2’: infrequent scattered function word

– p’: proportion of times term does not occur in a burst– 1-p’: proportion of times term appears in a burst

Page 59: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,
Page 60: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Very frequent function words

Page 61: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Less frequent function words

Page 62: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Style indicative terms

Page 63: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Content terms

Page 64: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Other aspects

• Coverage (narrow or broad)• Lay-out and meta data• Language • Links and mark-up• ….

Page 65: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Conclusions

• We need to talk more about the “elephant in the room” and know more about the datasets we use

• Dataset profiling can be a useful way of augmenting known results

Page 66: Dataset Profiling Anne De Roeck Udo Kruschwitz, Nick Web, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite Centre for Research in Computing The Open University,

Conclusions

• Profiles have to be relative to task• Measures have to be sophisticated enough to

be informative – starting only now• Finding effective profiling measures is a

substantial, difficult essential research agenda