224
Introduction Frequency estimates from search engines The “linguists’ friendly” interfaces Building your own web corpus Enter WaCky! Using the web as a source of linguistic data: experiences, problems and perspectives Marco Baroni SSLMIT, University of Bologna ICST/CNR Roma, April 2005

Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Using the web as a source of linguistic data:experiences, problems and perspectives

Marco Baroni

SSLMIT, University of Bologna

ICST/CNR Roma, April 2005

Page 2: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Outline

Introduction

Frequency estimates from search enginesWeb-based Mutual Information

The “linguists’ friendly” interfaces

Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Enter WaCky!

Page 3: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The Web as Corpus

I Computational/corpus linguists, lexicographers,ontologists, language technologists constantly hungry fordata.

I The web is a huge database of documents, mostly text.

I Kilgarriff: The web is the most exciting thing thathappened to human beings in the last 20 years or so, andit’s all about linguistic communication – we linguists arein a good position to lead the study of it!!!

Page 4: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The Web as Corpus

I Computational/corpus linguists, lexicographers,ontologists, language technologists constantly hungry fordata.

I The web is a huge database of documents, mostly text.

I Kilgarriff: The web is the most exciting thing thathappened to human beings in the last 20 years or so, andit’s all about linguistic communication – we linguists arein a good position to lead the study of it!!!

Page 5: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The Web as Corpus

I Computational/corpus linguists, lexicographers,ontologists, language technologists constantly hungry fordata.

I The web is a huge database of documents, mostly text.

I Kilgarriff: The web is the most exciting thing thathappened to human beings in the last 20 years or so, andit’s all about linguistic communication – we linguists arein a good position to lead the study of it!!!

Page 6: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The Web as Corpus (cont.)

I Kilgarriff and Grefenstette, Introduction to the SpecialIssue on the Web as Corpus, Computational Linguistics2003.

English 76,598,718,000German 7,035,850,000Italian 1,845,026,000

Finnish 326,379,000Esperanto 57,154,000

Latin 55,943,000Basque 55,340,000

Albanian 10,332,000

I (Obsolete, conservative estimates)

Page 7: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

Page 8: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

Page 9: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

Page 10: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

Page 11: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

Page 12: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

Page 13: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

Page 14: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

Page 15: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

But still. . . more data is better data!(Mercer quoted by Church)

I Banko and Brill 2001 HLT paper.

I Confusion set disambiguation task.

I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.

I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.

I With 1 billion word training set, learners have not reachedperformance asymptote.

I (Learn language function by simple algorithm that hasaccess to full extension of function.)

Page 16: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

But still. . . more data is better data!(Mercer quoted by Church)

I Banko and Brill 2001 HLT paper.

I Confusion set disambiguation task.

I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.

I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.

I With 1 billion word training set, learners have not reachedperformance asymptote.

I (Learn language function by simple algorithm that hasaccess to full extension of function.)

Page 17: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

But still. . . more data is better data!(Mercer quoted by Church)

I Banko and Brill 2001 HLT paper.

I Confusion set disambiguation task.

I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.

I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.

I With 1 billion word training set, learners have not reachedperformance asymptote.

I (Learn language function by simple algorithm that hasaccess to full extension of function.)

Page 18: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

But still. . . more data is better data!(Mercer quoted by Church)

I Banko and Brill 2001 HLT paper.

I Confusion set disambiguation task.

I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.

I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.

I With 1 billion word training set, learners have not reachedperformance asymptote.

I (Learn language function by simple algorithm that hasaccess to full extension of function.)

Page 19: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

But still. . . more data is better data!(Mercer quoted by Church)

I Banko and Brill 2001 HLT paper.

I Confusion set disambiguation task.

I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.

I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.

I With 1 billion word training set, learners have not reachedperformance asymptote.

I (Learn language function by simple algorithm that hasaccess to full extension of function.)

Page 20: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

But still. . . more data is better data!(Mercer quoted by Church)

I Banko and Brill 2001 HLT paper.

I Confusion set disambiguation task.

I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.

I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.

I With 1 billion word training set, learners have not reachedperformance asymptote.

I (Learn language function by simple algorithm that hasaccess to full extension of function.)

Page 21: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

More web-data is better data!

I Keller and Lapata 2003, Computational Linguistics.

I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:

I correlate with BNC and NANTC frequencies;I correlate with WordNet-class-based smoothed

frequencies;I correlate with human plausibility judgments more than

corpus-based frequencies do (smoothed or notsmoothed).

Page 22: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

More web-data is better data!

I Keller and Lapata 2003, Computational Linguistics.

I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:

I correlate with BNC and NANTC frequencies;I correlate with WordNet-class-based smoothed

frequencies;I correlate with human plausibility judgments more than

corpus-based frequencies do (smoothed or notsmoothed).

Page 23: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

More web-data is better data!

I Keller and Lapata 2003, Computational Linguistics.

I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:

I correlate with BNC and NANTC frequencies;

I correlate with WordNet-class-based smoothedfrequencies;

I correlate with human plausibility judgments more thancorpus-based frequencies do (smoothed or notsmoothed).

Page 24: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

More web-data is better data!

I Keller and Lapata 2003, Computational Linguistics.

I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:

I correlate with BNC and NANTC frequencies;I correlate with WordNet-class-based smoothed

frequencies;

I correlate with human plausibility judgments more thancorpus-based frequencies do (smoothed or notsmoothed).

Page 25: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

More web-data is better data!

I Keller and Lapata 2003, Computational Linguistics.

I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:

I correlate with BNC and NANTC frequencies;I correlate with WordNet-class-based smoothed

frequencies;I correlate with human plausibility judgments more than

corpus-based frequencies do (smoothed or notsmoothed).

Page 26: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Approaches to Web as Corpus

I Collect (frequency) data directly from commercial searchengines (e.g. Turney 2001, many many others).

I Linguist’s friendly interfaces to commercial searchengines: WebCorp, KwicFinder, LSE (Kehoe and Renouf2002, Fletcher 2002, Resnik and and Elkiss 2003).

I Small(-ish), focused crawls of the web to find and retrieverelevant pages (e.g. Ghani et al. 2001, Baroni andBernardini 2004, Sharoff submitted).

I WaCky!

Page 27: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Approaches to Web as Corpus

I Collect (frequency) data directly from commercial searchengines (e.g. Turney 2001, many many others).

I Linguist’s friendly interfaces to commercial searchengines: WebCorp, KwicFinder, LSE (Kehoe and Renouf2002, Fletcher 2002, Resnik and and Elkiss 2003).

I Small(-ish), focused crawls of the web to find and retrieverelevant pages (e.g. Ghani et al. 2001, Baroni andBernardini 2004, Sharoff submitted).

I WaCky!

Page 28: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Approaches to Web as Corpus

I Collect (frequency) data directly from commercial searchengines (e.g. Turney 2001, many many others).

I Linguist’s friendly interfaces to commercial searchengines: WebCorp, KwicFinder, LSE (Kehoe and Renouf2002, Fletcher 2002, Resnik and and Elkiss 2003).

I Small(-ish), focused crawls of the web to find and retrieverelevant pages (e.g. Ghani et al. 2001, Baroni andBernardini 2004, Sharoff submitted).

I WaCky!

Page 29: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Approaches to Web as Corpus

I Collect (frequency) data directly from commercial searchengines (e.g. Turney 2001, many many others).

I Linguist’s friendly interfaces to commercial searchengines: WebCorp, KwicFinder, LSE (Kehoe and Renouf2002, Fletcher 2002, Resnik and and Elkiss 2003).

I Small(-ish), focused crawls of the web to find and retrieverelevant pages (e.g. Ghani et al. 2001, Baroni andBernardini 2004, Sharoff submitted).

I WaCky!

Page 30: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Outline

Introduction

Frequency estimates from search enginesWeb-based Mutual Information

The “linguists’ friendly” interfaces

Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Enter WaCky!

Page 31: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Collecting frequency data from search engines

I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).

I Rough approximation to frequency, but:

I Empirically successful;I Easy: the engine does most of the hard work.

I Web-based mutual information: typical example ofresearch using search engine-based frequency data.

Page 32: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Collecting frequency data from search engines

I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).

I Rough approximation to frequency, but:

I Empirically successful;I Easy: the engine does most of the hard work.

I Web-based mutual information: typical example ofresearch using search engine-based frequency data.

Page 33: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Collecting frequency data from search engines

I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).

I Rough approximation to frequency, but:I Empirically successful;

I Easy: the engine does most of the hard work.

I Web-based mutual information: typical example ofresearch using search engine-based frequency data.

Page 34: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Collecting frequency data from search engines

I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).

I Rough approximation to frequency, but:I Empirically successful;I Easy: the engine does most of the hard work.

I Web-based mutual information: typical example ofresearch using search engine-based frequency data.

Page 35: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Collecting frequency data from search engines

I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).

I Rough approximation to frequency, but:I Empirically successful;I Easy: the engine does most of the hard work.

I Web-based mutual information: typical example ofresearch using search engine-based frequency data.

Page 36: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information (WMI)Turney 2001

I (Pointwise) mutual information:

MI (w1, w2) = log2

P(w1, w2)

P(w1)P(w2)= log2 N

C(w1, w2)

C(w1)C(w2)

I WMI: compute mutual information of word pairs usingfrequency/cooccurrence frequency data extracted fromthe web via AltaVista search engine.

WMI (w1, w2) = log2 Nhits(w1 NEAR w2)

hits(w1)hits(w2)

Page 37: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information (WMI)Turney 2001

I (Pointwise) mutual information:

MI (w1, w2) = log2

P(w1, w2)

P(w1)P(w2)= log2 N

C(w1, w2)

C(w1)C(w2)

I WMI: compute mutual information of word pairs usingfrequency/cooccurrence frequency data extracted fromthe web via AltaVista search engine.

WMI (w1, w2) = log2 Nhits(w1 NEAR w2)

hits(w1)hits(w2)

Page 38: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information (WMI)Turney 2001

I (Pointwise) mutual information:

MI (w1, w2) = log2

P(w1, w2)

P(w1)P(w2)= log2 N

C(w1, w2)

C(w1)C(w2)

I WMI: compute mutual information of word pairs usingfrequency/cooccurrence frequency data extracted fromthe web via AltaVista search engine.

WMI (w1, w2) = log2 Nhits(w1 NEAR w2)

hits(w1)hits(w2)

Page 39: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information

I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).

I Simplicity of method counterbalanced by size of database(the WWW).

I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.

I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.

I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.

Page 40: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information

I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).

I Simplicity of method counterbalanced by size of database(the WWW).

I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.

I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.

I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.

Page 41: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information

I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).

I Simplicity of method counterbalanced by size of database(the WWW).

I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.

I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.

I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.

Page 42: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information

I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).

I Simplicity of method counterbalanced by size of database(the WWW).

I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.

I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.

I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.

Page 43: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information

I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).

I Simplicity of method counterbalanced by size of database(the WWW).

I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.

I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.

I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.

Page 44: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI takes the TOEFLTurney 2001

I TOEFL synonym match task.

I Target: levied; Candidates: imposed, believed, requested,correlated.

Page 45: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI takes the TOEFLTurney 2001

I TOEFL synonym match task.

I Target: levied; Candidates: imposed, believed, requested,correlated.

Page 46: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI takes the TOEFLTurney 2001

I TOEFL synonym match task.

I Target: levied; Candidates: imposed, believed, requested,correlated.

Page 47: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI takes the TOEFL (cont.)

I Performance on TOEFL synonym match task:

I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%

Page 48: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI takes the TOEFL (cont.)

I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%

I Latent Semantic Analysis: 65.4%I WMI: 72.5%

Page 49: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI takes the TOEFL (cont.)

I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%

I WMI: 72.5%

Page 50: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI takes the TOEFL (cont.)

I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%

Page 51: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI and synonym detection in terminology

I Baroni and Bisi 2004 applied WMI-method to synonymmining task in technical domain.

I A harder task:

I Technical terms less frequent than general languageterms (potential data sparseness issues);

I All terms in domain tend to be semantically related, tosome extent.

Page 52: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI and synonym detection in terminology

I Baroni and Bisi 2004 applied WMI-method to synonymmining task in technical domain.

I A harder task:

I Technical terms less frequent than general languageterms (potential data sparseness issues);

I All terms in domain tend to be semantically related, tosome extent.

Page 53: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI and synonym detection in terminology

I Baroni and Bisi 2004 applied WMI-method to synonymmining task in technical domain.

I A harder task:I Technical terms less frequent than general language

terms (potential data sparseness issues);

I All terms in domain tend to be semantically related, tosome extent.

Page 54: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI and synonym detection in terminology

I Baroni and Bisi 2004 applied WMI-method to synonymmining task in technical domain.

I A harder task:I Technical terms less frequent than general language

terms (potential data sparseness issues);I All terms in domain tend to be semantically related, to

some extent.

Page 55: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Materials

I Nautical terminology.

I Terms and relational information from structuredtermbase of Bisi (2003).

Page 56: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Materials

I Nautical terminology.

I Terms and relational information from structuredtermbase of Bisi (2003).

Page 57: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Task

I Given a list of pairs in any order, rank them so thatsynonym pairs will be on top of list.

Page 58: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Task: example

I decks/cockpit

I frames/ribs

I bottom/hull

I ...

I frames/hull

Page 59: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Task: example

I frames/ribs

I bottom/hull

I decks/cockpit

I ...

I frames/hull

Page 60: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Task: settings

I Synonym term pairs vs. random term pairs (Exp 1).

I Synonym term pairs vs. other “nymic” pairs (Exp 2).

Page 61: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Task: settings

I Synonym term pairs vs. random term pairs (Exp 1).

I Synonym term pairs vs. other “nymic” pairs (Exp 2).

Page 62: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity

I Term of comparison.

I Intuition: Words with similar patterns of cooccurrence arelikely to be similar.

I Correlation of vectors of cooccurrence frequencies oftargets with (almost) all words in corpus:

cos(−→x ,−→y ) = −→x · −→y =n∑

i=1

xiyi

Page 63: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity

I Term of comparison.

I Intuition: Words with similar patterns of cooccurrence arelikely to be similar.

I Correlation of vectors of cooccurrence frequencies oftargets with (almost) all words in corpus:

cos(−→x ,−→y ) = −→x · −→y =n∑

i=1

xiyi

Page 64: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity

I Term of comparison.

I Intuition: Words with similar patterns of cooccurrence arelikely to be similar.

I Correlation of vectors of cooccurrence frequencies oftargets with (almost) all words in corpus:

cos(−→x ,−→y ) = −→x · −→y =n∑

i=1

xiyi

Page 65: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity (cont.)

I Corpora:

I 1.2M word specialized corpus manually assembled byterminologist;

I 4.27M word corpus constructed via random nauticalterm queries to Google.

I Context windows:

I 2 words to either side of target;I 5 words to either side of target.

Page 66: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity (cont.)

I Corpora:I 1.2M word specialized corpus manually assembled by

terminologist;

I 4.27M word corpus constructed via random nauticalterm queries to Google.

I Context windows:

I 2 words to either side of target;I 5 words to either side of target.

Page 67: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity (cont.)

I Corpora:I 1.2M word specialized corpus manually assembled by

terminologist;I 4.27M word corpus constructed via random nautical

term queries to Google.

I Context windows:

I 2 words to either side of target;I 5 words to either side of target.

Page 68: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity (cont.)

I Corpora:I 1.2M word specialized corpus manually assembled by

terminologist;I 4.27M word corpus constructed via random nautical

term queries to Google.

I Context windows:

I 2 words to either side of target;I 5 words to either side of target.

Page 69: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity (cont.)

I Corpora:I 1.2M word specialized corpus manually assembled by

terminologist;I 4.27M word corpus constructed via random nautical

term queries to Google.

I Context windows:I 2 words to either side of target;

I 5 words to either side of target.

Page 70: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity (cont.)

I Corpora:I 1.2M word specialized corpus manually assembled by

terminologist;I 4.27M word corpus constructed via random nautical

term queries to Google.

I Context windows:I 2 words to either side of target;I 5 words to either side of target.

Page 71: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 1: Data

I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).

I 124 non-synonym pairs:

I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.

I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).

Page 72: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 1: Data

I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).

I 124 non-synonym pairs:

I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.

I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).

Page 73: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 1: Data

I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).

I 124 non-synonym pairs:I 100 random pairs of nautical terms;

I 24 recombinations of terms in synonym set.

I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).

Page 74: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 1: Data

I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).

I 124 non-synonym pairs:I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.

I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).

Page 75: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 1: Data

I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).

I 124 non-synonym pairs:I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.

I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).

Page 76: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 1: ResultsPercentage precision at various percentage recall levels

recall WMI Cosinesman corp man corp web corp web corp

2-word win 5-word win 2-word win 5-word win

12.5 100.0 100.0 60.0 60.0 42.925.0 100.0 75.0 60.0 46.2 46.237.5 90.0 42.9 39.1 40.9 45.050.0 92.3 17.9 19.4 26.7 25.562.5 88.2 10.8 15.0 19.0 17.675.0 36.7 12.7 12.7 12.7 13.487.5 30.4 14.5 14.5 14.5 14.5100.0 16.2 16.2 16.2 16.2 16.2

Page 77: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: Data

I Same 24 synonym pairs as above.

I 31 nymic pairs from Bisi termbase added to test set:

I 19 cohyponym pairs (e.g., Bruce anchor/mushroomanchor);

I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).

I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).

Page 78: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: Data

I Same 24 synonym pairs as above.

I 31 nymic pairs from Bisi termbase added to test set:

I 19 cohyponym pairs (e.g., Bruce anchor/mushroomanchor);

I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).

I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).

Page 79: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: Data

I Same 24 synonym pairs as above.

I 31 nymic pairs from Bisi termbase added to test set:I 19 cohyponym pairs (e.g., Bruce anchor/mushroom

anchor);

I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).

I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).

Page 80: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: Data

I Same 24 synonym pairs as above.

I 31 nymic pairs from Bisi termbase added to test set:I 19 cohyponym pairs (e.g., Bruce anchor/mushroom

anchor);I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);

I 2 antonyms (e.g., ahead/astern).

I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).

Page 81: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: Data

I Same 24 synonym pairs as above.

I 31 nymic pairs from Bisi termbase added to test set:I 19 cohyponym pairs (e.g., Bruce anchor/mushroom

anchor);I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).

I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).

Page 82: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: Data

I Same 24 synonym pairs as above.

I 31 nymic pairs from Bisi termbase added to test set:I 19 cohyponym pairs (e.g., Bruce anchor/mushroom

anchor);I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).

I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).

Page 83: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: ResultsPercentage precision at various percentage recall levels

recall WMI Cosinesman corp man corp web corp web corp

2-word win 5-word win 2-word win 5-word win

12.5 60.0 42.9 37.5 27.3 20.025.0 33.3 46.2 46.2 28.6 27.337.5 36.0 39.1 39.1 29.0 31.050.0 40.0 19.7 21.1 23.1 22.662.5 37.5 10.8 17.4 19.2 18.175.0 26.5 12.7 12.7 12.7 14.187.5 25.6 14.5 14.5 14.5 14.5100.0 16.2 16.2 16.2 16.2 16.2

Page 84: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Houston, we have a problem

I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.

I End of the NEAR operator.

I Change of underlying database.

I WMI without NEAR:

WMI (w1, w2) = log2 Nhits(w1 w2)

hits(w1)hits(w2)

Page 85: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Houston, we have a problem

I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.

I End of the NEAR operator.

I Change of underlying database.

I WMI without NEAR:

WMI (w1, w2) = log2 Nhits(w1 w2)

hits(w1)hits(w2)

Page 86: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Houston, we have a problem

I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.

I End of the NEAR operator.

I Change of underlying database.

I WMI without NEAR:

WMI (w1, w2) = log2 Nhits(w1 w2)

hits(w1)hits(w2)

Page 87: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Houston, we have a problem

I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.

I End of the NEAR operator.

I Change of underlying database.

I WMI without NEAR:

WMI (w1, w2) = log2 Nhits(w1 w2)

hits(w1)hits(w2)

Page 88: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Houston, we have a problem

I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.

I End of the NEAR operator.

I Change of underlying database.

I WMI without NEAR:

WMI (w1, w2) = log2 Nhits(w1 w2)

hits(w1)hits(w2)

Page 89: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 1: with and without NEARPercentage precision at various percentage recall levels

recall AltaVista AltaVista Googlew/ NEAR w/o NEAR

12.5 100.0 100.0 100.025.0 100.0 100.0 85.737.5 90.0 100.0 81.850.0 92.3 75 85.762.5 88.2 62.5 60.075.0 36.7 45.0 64.387.5 30.4 34.4 45.6100.0 16.2 19.3 17.3

Page 90: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: with and without NEARPercentage precision at various percentage recall levels

recall AltaVista AltaVista Googlew/ NEAR w/o NEAR

12.5 60.0 42.8 50.025.0 33.3 50.0 37.537.5 36.0 52.9 45.050.0 40.0 38.7 40.062.5 37.5 32.6 31.975.0 26.5 28.6 34.087.5 25.6 25.6 30.0100.0 16.2 18.5 17.0

Page 91: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Pros and cons of search engine frequencies

I The main advantage: it’s easy.

I The main problem: we depend on commercial searchengines.

I Linguist’s satisfaction is obviously not their priority.

Page 92: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Pros and cons of search engine frequencies

I The main advantage: it’s easy.

I The main problem: we depend on commercial searchengines.

I Linguist’s satisfaction is obviously not their priority.

Page 93: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Pros and cons of search engine frequencies

I The main advantage: it’s easy.

I The main problem: we depend on commercial searchengines.

I Linguist’s satisfaction is obviously not their priority.

Page 94: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

A telling anecdote

(Talking to a new acquaintance who works at Google)

Me: So, do you guys have plans to introduce the NEARoperator?

The Google Acquaintance: You are a linguist right? Onlylinguists ask about that sort of stuff. . .

Page 95: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

A telling anecdote

(Talking to a new acquaintance who works at Google)

Me: So, do you guys have plans to introduce the NEARoperator?

The Google Acquaintance: You are a linguist right? Onlylinguists ask about that sort of stuff. . .

Page 96: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

A telling anecdote

(Talking to a new acquaintance who works at Google)

Me: So, do you guys have plans to introduce the NEARoperator?

The Google Acquaintance: You are a linguist right? Onlylinguists ask about that sort of stuff. . .

Page 97: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!

Page 98: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!

Page 99: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!

Page 100: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!

Page 101: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!

Page 102: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!

Page 103: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!

Page 104: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Fletcher 2004 saying the same things

Search engines are not research libraries but commercial enterprisestargeted at the needs of the general public. The availability andimplementation of their services change constantly: features are added ordropped to mimic or outdo the competition; acquisitions and mergersthreaten their independence; financial uncertainties and legal battleschallenge their very survival. The search sites’ quest for revenue candiminish the objectivity of their search results, and various “pageranking” algorithms may lead to results that are not representative of theWeb as a whole. Most frustrating is the minimal support for therequirements of serious researchers: current trends lead away from siteslike AltaVista supporting sophisticated complex queries (which few usersemploy) to ones like Google offering only simple search criteria. In short,the search engines’ services are useful to investigators by coincidence, notdesign, and researchers are tolerated on mainstream search sites only aslong as their use does not affect site performance adversely.

Page 105: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Worrying data from the Google APIsPattern discovered by Luca Onnis

Query APIs Website Ratio

pleasantly 369000 870000 0.42awkwardly 124000 292000 0.42silent 4610000 11000000 0.42pleasantly silent 107 135 0.79awkwardly silent 396 566 0.70

Page 106: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

A few more things to worry about

I Google inflating its counts (Veronis’s blog, 2005).

I Is the * operator still supported?

Page 107: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

A few more things to worry about

I Google inflating its counts (Veronis’s blog, 2005).

I Is the * operator still supported?

Page 108: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Outline

Introduction

Frequency estimates from search enginesWeb-based Mutual Information

The “linguists’ friendly” interfaces

Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Enter WaCky!

Page 109: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The “linguist’s friendly” interfaces

I WebCorp, KwicFinder, Linguist’s Search Engine.

I “Wrappers” around Google, AltaVista, etc.

I Nice interfaces, but ultimately inherit all problems ofsearch engines, and perhaps add some more with theirfilters. . .

I E.g., “spongi*” query in webCorp (Stefan Evert).

Page 110: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The “linguist’s friendly” interfaces

I WebCorp, KwicFinder, Linguist’s Search Engine.

I “Wrappers” around Google, AltaVista, etc.

I Nice interfaces, but ultimately inherit all problems ofsearch engines, and perhaps add some more with theirfilters. . .

I E.g., “spongi*” query in webCorp (Stefan Evert).

Page 111: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The “linguist’s friendly” interfaces

I WebCorp, KwicFinder, Linguist’s Search Engine.

I “Wrappers” around Google, AltaVista, etc.

I Nice interfaces, but ultimately inherit all problems ofsearch engines, and perhaps add some more with theirfilters. . .

I E.g., “spongi*” query in webCorp (Stefan Evert).

Page 112: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The “linguist’s friendly” interfaces

I WebCorp, KwicFinder, Linguist’s Search Engine.

I “Wrappers” around Google, AltaVista, etc.

I Nice interfaces, but ultimately inherit all problems ofsearch engines, and perhaps add some more with theirfilters. . .

I E.g., “spongi*” query in webCorp (Stefan Evert).

Page 113: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Outline

Introduction

Frequency estimates from search enginesWeb-based Mutual Information

The “linguists’ friendly” interfaces

Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Enter WaCky!

Page 114: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Building special corpora with search engine queries

I By downloading text, more control over data.

I But less work, more targeted data than spidering yourown corpus.

I Good for “special purposes” corpora:

I “minority” languages (CorpusBuilder; Ghani, Jones,Mladenic, CIKM-2001).;

I specialized sub-languages (BootCaT).

Page 115: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Building special corpora with search engine queries

I By downloading text, more control over data.

I But less work, more targeted data than spidering yourown corpus.

I Good for “special purposes” corpora:

I “minority” languages (CorpusBuilder; Ghani, Jones,Mladenic, CIKM-2001).;

I specialized sub-languages (BootCaT).

Page 116: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Building special corpora with search engine queries

I By downloading text, more control over data.

I But less work, more targeted data than spidering yourown corpus.

I Good for “special purposes” corpora:

I “minority” languages (CorpusBuilder; Ghani, Jones,Mladenic, CIKM-2001).;

I specialized sub-languages (BootCaT).

Page 117: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Building special corpora with search engine queries

I By downloading text, more control over data.

I But less work, more targeted data than spidering yourown corpus.

I Good for “special purposes” corpora:I “minority” languages (CorpusBuilder; Ghani, Jones,

Mladenic, CIKM-2001).;

I specialized sub-languages (BootCaT).

Page 118: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Building special corpora with search engine queries

I By downloading text, more control over data.

I But less work, more targeted data than spidering yourown corpus.

I Good for “special purposes” corpora:I “minority” languages (CorpusBuilder; Ghani, Jones,

Mladenic, CIKM-2001).;I specialized sub-languages (BootCaT).

Page 119: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The BootCaT tools

I Bootstrapping Corpora and Terms from the web.

I Perl scripts freely available from:http://sslmit.unibo.it/∼baroni/bootcat.html

I Original motivation: fast construction of ad-hoc corporaand term lists for translation/interpreting tasks,terminography.

Page 120: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The BootCaT tools

I Bootstrapping Corpora and Terms from the web.

I Perl scripts freely available from:http://sslmit.unibo.it/∼baroni/bootcat.html

I Original motivation: fast construction of ad-hoc corporaand term lists for translation/interpreting tasks,terminography.

Page 121: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The BootCaT tools

I Bootstrapping Corpora and Terms from the web.

I Perl scripts freely available from:http://sslmit.unibo.it/∼baroni/bootcat.html

I Original motivation: fast construction of ad-hoc corporaand term lists for translation/interpreting tasks,terminography.

Page 122: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The BootCaT procedure

Select initial terms

Query Google for random term combinations

Extract new terms via corpus comparison

Retrieve pages and format as text (corpus)

Distributional patterns POS templates

Extract multi-word terms using corpus, uni-terms and...

Page 123: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Terms and Term Combinations

I 5-20 terms typical of domain.

I Selection: human or automated (e.g. via text/corpuscomparison).

I Seed terms randomly combined into tuples to performGoogle queries:

I Longer tuples: better precision;I Shorter tuples: better recall.

Page 124: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Terms and Term Combinations

I 5-20 terms typical of domain.

I Selection: human or automated (e.g. via text/corpuscomparison).

I Seed terms randomly combined into tuples to performGoogle queries:

I Longer tuples: better precision;I Shorter tuples: better recall.

Page 125: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Terms and Term Combinations

I 5-20 terms typical of domain.

I Selection: human or automated (e.g. via text/corpuscomparison).

I Seed terms randomly combined into tuples to performGoogle queries:

I Longer tuples: better precision;I Shorter tuples: better recall.

Page 126: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Terms and Term Combinations

I 5-20 terms typical of domain.

I Selection: human or automated (e.g. via text/corpuscomparison).

I Seed terms randomly combined into tuples to performGoogle queries:

I Longer tuples: better precision;

I Shorter tuples: better recall.

Page 127: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Terms and Term Combinations

I 5-20 terms typical of domain.

I Selection: human or automated (e.g. via text/corpuscomparison).

I Seed terms randomly combined into tuples to performGoogle queries:

I Longer tuples: better precision;I Shorter tuples: better recall.

Page 128: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison

with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.I Retrieved pages formatted as text (character set issues,

non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on

different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)

Page 129: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;

2. Extract typical terms through statistical comparisonwith reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.I Retrieved pages formatted as text (character set issues,

non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on

different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)

Page 130: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison

with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.I Retrieved pages formatted as text (character set issues,

non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on

different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)

Page 131: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison

with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.I Retrieved pages formatted as text (character set issues,

non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on

different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)

Page 132: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison

with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.

I Retrieved pages formatted as text (character set issues,non-text format issues; in Japanese: tokenization issues).

I Reference corpus: better if balanced, but any corpus ondifferent topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)

Page 133: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison

with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.I Retrieved pages formatted as text (character set issues,

non-text format issues; in Japanese: tokenization issues).

I Reference corpus: better if balanced, but any corpus ondifferent topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)

Page 134: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison

with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.I Retrieved pages formatted as text (character set issues,

non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on

different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)

Page 135: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004

I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.

I Reference: Brown (1.1M words).

I Corpus comparison: via Log Odds Ratio.

I Two iterations.

I 1.4M word corpus constructed, 1800 unigram termsextracted.

I 20/30 randomly selected documents from corpus rated asrelevant and informative.

Page 136: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004

I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.

I Reference: Brown (1.1M words).

I Corpus comparison: via Log Odds Ratio.

I Two iterations.

I 1.4M word corpus constructed, 1800 unigram termsextracted.

I 20/30 randomly selected documents from corpus rated asrelevant and informative.

Page 137: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004

I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.

I Reference: Brown (1.1M words).

I Corpus comparison: via Log Odds Ratio.

I Two iterations.

I 1.4M word corpus constructed, 1800 unigram termsextracted.

I 20/30 randomly selected documents from corpus rated asrelevant and informative.

Page 138: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004

I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.

I Reference: Brown (1.1M words).

I Corpus comparison: via Log Odds Ratio.

I Two iterations.

I 1.4M word corpus constructed, 1800 unigram termsextracted.

I 20/30 randomly selected documents from corpus rated asrelevant and informative.

Page 139: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004

I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.

I Reference: Brown (1.1M words).

I Corpus comparison: via Log Odds Ratio.

I Two iterations.

I 1.4M word corpus constructed, 1800 unigram termsextracted.

I 20/30 randomly selected documents from corpus rated asrelevant and informative.

Page 140: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004

I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.

I Reference: Brown (1.1M words).

I Corpus comparison: via Log Odds Ratio.

I Two iterations.

I 1.4M word corpus constructed, 1800 unigram termsextracted.

I 20/30 randomly selected documents from corpus rated asrelevant and informative.

Page 141: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.

Page 142: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.

Page 143: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.

Page 144: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.

Page 145: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.

Page 146: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.

Page 147: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.

Page 148: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Applications

I Languages: English, Italian, Japanese, Spanish, German,French, Russian, Chinese, Danish.

I Domains: medical, legal, meteorology, food, nauticalterminology, (e-)commerce. . .

I Uses: technical translation, interpreting tasks, resourcesfor LSP teaching, populating ontologies, expanding alexicon in systematic ways, general corpus construction(Sharoff submitted).

Page 149: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Applications

I Languages: English, Italian, Japanese, Spanish, German,French, Russian, Chinese, Danish.

I Domains: medical, legal, meteorology, food, nauticalterminology, (e-)commerce. . .

I Uses: technical translation, interpreting tasks, resourcesfor LSP teaching, populating ontologies, expanding alexicon in systematic ways, general corpus construction(Sharoff submitted).

Page 150: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Applications

I Languages: English, Italian, Japanese, Spanish, German,French, Russian, Chinese, Danish.

I Domains: medical, legal, meteorology, food, nauticalterminology, (e-)commerce. . .

I Uses: technical translation, interpreting tasks, resourcesfor LSP teaching, populating ontologies, expanding alexicon in systematic ways, general corpus construction(Sharoff submitted).

Page 151: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Ongoing and planned work

I Special queries.

I Better character set handling.

I Better pdf/doc conversion.

I Better integration with UCS and other tools.

I Multi-term extraction.

I Yahoo API?

Page 152: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I We still rely on commercial search engine, but less so.

I We only use most basic query function, less likely tochange.

I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.

I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.

I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.

Page 153: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I We still rely on commercial search engine, but less so.

I We only use most basic query function, less likely tochange.

I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.

I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.

I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.

Page 154: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I We still rely on commercial search engine, but less so.

I We only use most basic query function, less likely tochange.

I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.

I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.

I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.

Page 155: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I We still rely on commercial search engine, but less so.

I We only use most basic query function, less likely tochange.

I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.

I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.

I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.

Page 156: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I We still rely on commercial search engine, but less so.

I We only use most basic query function, less likely tochange.

I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.

I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.

I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.

Page 157: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I We still rely on commercial search engine:

I What happens if Google discontinues API service?I What happens if Google does something too smart or

too commercial with the page ranks?

I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.

I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).

I Not for exploiting vastness of web-as-corpus directly.

Page 158: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I We still rely on commercial search engine:I What happens if Google discontinues API service?

I What happens if Google does something too smart ortoo commercial with the page ranks?

I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.

I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).

I Not for exploiting vastness of web-as-corpus directly.

Page 159: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I We still rely on commercial search engine:I What happens if Google discontinues API service?I What happens if Google does something too smart or

too commercial with the page ranks?

I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.

I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).

I Not for exploiting vastness of web-as-corpus directly.

Page 160: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I We still rely on commercial search engine:I What happens if Google discontinues API service?I What happens if Google does something too smart or

too commercial with the page ranks?

I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.

I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).

I Not for exploiting vastness of web-as-corpus directly.

Page 161: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I We still rely on commercial search engine:I What happens if Google discontinues API service?I What happens if Google does something too smart or

too commercial with the page ranks?

I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.

I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).

I Not for exploiting vastness of web-as-corpus directly.

Page 162: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I We still rely on commercial search engine:I What happens if Google discontinues API service?I What happens if Google does something too smart or

too commercial with the page ranks?

I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.

I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).

I Not for exploiting vastness of web-as-corpus directly.

Page 163: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Biting the bullet. . .

I Crawling, cleaning, annotating, managing andmaintaining your own indexed version of the web.

I Obviously, the “ideal” solution.

I But obviously a lot of work!

Page 164: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Biting the bullet. . .

I Crawling, cleaning, annotating, managing andmaintaining your own indexed version of the web.

I Obviously, the “ideal” solution.

I But obviously a lot of work!

Page 165: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Biting the bullet. . .

I Crawling, cleaning, annotating, managing andmaintaining your own indexed version of the web.

I Obviously, the “ideal” solution.

I But obviously a lot of work!

Page 166: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Build your own search engine

I Crawling.

I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )

I Linguistic processing.

I Categorization, meta-data.

I Indexing.

I Interfaces.

Page 167: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Build your own search engine

I Crawling.

I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )

I Linguistic processing.

I Categorization, meta-data.

I Indexing.

I Interfaces.

Page 168: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Build your own search engine

I Crawling.

I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )

I Linguistic processing.

I Categorization, meta-data.

I Indexing.

I Interfaces.

Page 169: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Build your own search engine

I Crawling.

I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )

I Linguistic processing.

I Categorization, meta-data.

I Indexing.

I Interfaces.

Page 170: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Build your own search engine

I Crawling.

I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )

I Linguistic processing.

I Categorization, meta-data.

I Indexing.

I Interfaces.

Page 171: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Build your own search engine

I Crawling.

I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )

I Linguistic processing.

I Categorization, meta-data.

I Indexing.

I Interfaces.

Page 172: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The huge web-corpus of Clarke and collaborators

I Terabyte crawl of the web in 2001.

I From initial seed set of 2392 (English?) educationalURLs.

I No duplicates, not too many pages from same site.

I No language filtering.

I 53 billion words, 77 million documents.

I (BNC has 100 million words; Google indexes 8 billiondocuments.)

Page 173: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The huge web-corpus of Clarke and collaborators

I Terabyte crawl of the web in 2001.

I From initial seed set of 2392 (English?) educationalURLs.

I No duplicates, not too many pages from same site.

I No language filtering.

I 53 billion words, 77 million documents.

I (BNC has 100 million words; Google indexes 8 billiondocuments.)

Page 174: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The huge web-corpus of Clarke and collaborators

I Terabyte crawl of the web in 2001.

I From initial seed set of 2392 (English?) educationalURLs.

I No duplicates, not too many pages from same site.

I No language filtering.

I 53 billion words, 77 million documents.

I (BNC has 100 million words; Google indexes 8 billiondocuments.)

Page 175: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The huge web-corpus of Clarke and collaborators

I Terabyte crawl of the web in 2001.

I From initial seed set of 2392 (English?) educationalURLs.

I No duplicates, not too many pages from same site.

I No language filtering.

I 53 billion words, 77 million documents.

I (BNC has 100 million words; Google indexes 8 billiondocuments.)

Page 176: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The huge web-corpus of Clarke and collaborators

I Terabyte crawl of the web in 2001.

I From initial seed set of 2392 (English?) educationalURLs.

I No duplicates, not too many pages from same site.

I No language filtering.

I 53 billion words, 77 million documents.

I (BNC has 100 million words; Google indexes 8 billiondocuments.)

Page 177: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The huge web-corpus of Clarke and collaborators

I Terabyte crawl of the web in 2001.

I From initial seed set of 2392 (English?) educationalURLs.

I No duplicates, not too many pages from same site.

I No language filtering.

I 53 billion words, 77 million documents.

I (BNC has 100 million words; Google indexes 8 billiondocuments.)

Page 178: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The TOEFL synonym match test, again

I Target: levied; Candidates: imposed, believed, requested,correlated.

Page 179: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The TOEFL synonym match test, again

I Target: levied; Candidates: imposed, believed, requested,correlated.

Page 180: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

WMI takes the TOEFL againTerra and Clarke 2003

I Performance on TOEFL synonym match task:

I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%

Page 181: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

WMI takes the TOEFL againTerra and Clarke 2003

I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%

I Latent Semantic Analysis: 65.4%I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%

Page 182: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

WMI takes the TOEFL againTerra and Clarke 2003

I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%

I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%

Page 183: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

WMI takes the TOEFL againTerra and Clarke 2003

I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%

I Terra & Clarke’s WMI: 81.25%

Page 184: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

WMI takes the TOEFL againTerra and Clarke 2003

I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%

Page 185: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I Independence from commercial search engines.

I Precious, multi-purpose resource.

I In principle, you can do what you want with it.

Page 186: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I Independence from commercial search engines.

I Precious, multi-purpose resource.

I In principle, you can do what you want with it.

Page 187: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I Independence from commercial search engines.

I Precious, multi-purpose resource.

I In principle, you can do what you want with it.

Page 188: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I A lot of work.

I Resource-intensive.

I In principle, you can do what you want with it. . .

I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.

I Forget about the “do it yourself with a perl script”approach.

Page 189: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I A lot of work.

I Resource-intensive.

I In principle, you can do what you want with it. . .

I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.

I Forget about the “do it yourself with a perl script”approach.

Page 190: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I A lot of work.

I Resource-intensive.

I In principle, you can do what you want with it. . .

I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.

I Forget about the “do it yourself with a perl script”approach.

Page 191: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I A lot of work.

I Resource-intensive.

I In principle, you can do what you want with it. . .

I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.

I Forget about the “do it yourself with a perl script”approach.

Page 192: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I A lot of work.

I Resource-intensive.

I In principle, you can do what you want with it. . .

I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.

I Forget about the “do it yourself with a perl script”approach.

Page 193: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Outline

Introduction

Frequency estimates from search enginesWeb-based Mutual Information

The “linguists’ friendly” interfaces

Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Enter WaCky!

Page 194: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.

Page 195: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.

Page 196: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.

Page 197: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.

Page 198: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.

Page 199: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.

Page 200: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.

Page 201: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .

I but there are important differences, for example:

I We probably want all data, or perhaps random data, oreven linguistically interesting data, not necessarily mostrelevant data.

I We care about (linguistic) form at least as much asabout content.

I A new challenge in computational linguistics: data arenot given.

Page 202: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .

I but there are important differences, for example:

I We probably want all data, or perhaps random data, oreven linguistically interesting data, not necessarily mostrelevant data.

I We care about (linguistic) form at least as much asabout content.

I A new challenge in computational linguistics: data arenot given.

Page 203: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .

I but there are important differences, for example:I We probably want all data, or perhaps random data, or

even linguistically interesting data, not necessarily mostrelevant data.

I We care about (linguistic) form at least as much asabout content.

I A new challenge in computational linguistics: data arenot given.

Page 204: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .

I but there are important differences, for example:I We probably want all data, or perhaps random data, or

even linguistically interesting data, not necessarily mostrelevant data.

I We care about (linguistic) form at least as much asabout content.

I A new challenge in computational linguistics: data arenot given.

Page 205: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .

I but there are important differences, for example:I We probably want all data, or perhaps random data, or

even linguistically interesting data, not necessarily mostrelevant data.

I We care about (linguistic) form at least as much asabout content.

I A new challenge in computational linguistics: data arenot given.

Page 206: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:

I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

Page 207: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;

I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

Page 208: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;

I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

Page 209: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;

I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

Page 210: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;

I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

Page 211: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;

I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

Page 212: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

Page 213: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

Page 214: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:I Access speed;

I Updating;I Size;I Content-driven relevance.

Page 215: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:I Access speed;I Updating;

I Size;I Content-driven relevance.

Page 216: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:I Access speed;I Updating;I Size;

I Content-driven relevance.

Page 217: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:I Access speed;I Updating;I Size;I Content-driven relevance.

Page 218: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The WaCkodules: Where We Are At

I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.

I Crawling: with Heritrix, the Internet Archive crawler.

I Post-processing: current focus on duplicate detection.

I Linguistic annotation, meta-data: nothing yet.

I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.

I Interfaces: work by Stefan Evert.

Page 219: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The WaCkodules: Where We Are At

I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.

I Crawling: with Heritrix, the Internet Archive crawler.

I Post-processing: current focus on duplicate detection.

I Linguistic annotation, meta-data: nothing yet.

I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.

I Interfaces: work by Stefan Evert.

Page 220: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The WaCkodules: Where We Are At

I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.

I Crawling: with Heritrix, the Internet Archive crawler.

I Post-processing: current focus on duplicate detection.

I Linguistic annotation, meta-data: nothing yet.

I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.

I Interfaces: work by Stefan Evert.

Page 221: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The WaCkodules: Where We Are At

I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.

I Crawling: with Heritrix, the Internet Archive crawler.

I Post-processing: current focus on duplicate detection.

I Linguistic annotation, meta-data: nothing yet.

I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.

I Interfaces: work by Stefan Evert.

Page 222: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The WaCkodules: Where We Are At

I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.

I Crawling: with Heritrix, the Internet Archive crawler.

I Post-processing: current focus on duplicate detection.

I Linguistic annotation, meta-data: nothing yet.

I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.

I Interfaces: work by Stefan Evert.

Page 223: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The WaCkodules: Where We Are At

I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.

I Crawling: with Heritrix, the Internet Archive crawler.

I Post-processing: current focus on duplicate detection.

I Linguistic annotation, meta-data: nothing yet.

I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.

I Interfaces: work by Stefan Evert.

Page 224: Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

A few references

M. Baroni and S. Bernardini. 2004. BootCaT: Bootstrapping corpora and terms fromthe web. LREC 2004.M. Baroni and S. Bisi 2004. Using cooccurrence statistics and the web to discoversynonyms in a specialized language. LREC 2004.M. Banko and E. Brill. 2001. Scaling to very very large corpora for natural languagedisambiguation. ACL 2001.W. Fletcher. 2004. Facilitating the compilation and dissemination of ad-hoc webcorpora. Papers from TALC 2002.R. Ghani, R. Jones, and D. Mladenic. 2001. Mining the web to create minoritylanguage corpora. CIKM 2001.F. Keller and M. Lapata. 2003. Using the web to obtain frequencies for unseenbigrams. Computational Linguistics 29.A. Kilgarriff. 2003. Linguistic search engine. Corpus Linguistics 2003.E. Terra and C. L. A. Clarke. 2003. Frequency estimates for statistical word similaritymeasures. HLT-NAACL-03.P. Turney. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL.ECML-2001.