Using the web as a source of linguistic data: experiences...

Preview:

Citation preview

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Using the web as a source of linguistic data:experiences, problems and perspectives

Marco Baroni

SSLMIT, University of Bologna

ICST/CNR Roma, April 2005

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Outline

Introduction

Frequency estimates from search enginesWeb-based Mutual Information

The “linguists’ friendly” interfaces

Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Enter WaCky!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The Web as Corpus

I Computational/corpus linguists, lexicographers,ontologists, language technologists constantly hungry fordata.

I The web is a huge database of documents, mostly text.

I Kilgarriff: The web is the most exciting thing thathappened to human beings in the last 20 years or so, andit’s all about linguistic communication – we linguists arein a good position to lead the study of it!!!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The Web as Corpus

I Computational/corpus linguists, lexicographers,ontologists, language technologists constantly hungry fordata.

I The web is a huge database of documents, mostly text.

I Kilgarriff: The web is the most exciting thing thathappened to human beings in the last 20 years or so, andit’s all about linguistic communication – we linguists arein a good position to lead the study of it!!!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The Web as Corpus

I Computational/corpus linguists, lexicographers,ontologists, language technologists constantly hungry fordata.

I The web is a huge database of documents, mostly text.

I Kilgarriff: The web is the most exciting thing thathappened to human beings in the last 20 years or so, andit’s all about linguistic communication – we linguists arein a good position to lead the study of it!!!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The Web as Corpus (cont.)

I Kilgarriff and Grefenstette, Introduction to the SpecialIssue on the Web as Corpus, Computational Linguistics2003.

English 76,598,718,000German 7,035,850,000Italian 1,845,026,000

Finnish 326,379,000Esperanto 57,154,000

Latin 55,943,000Basque 55,340,000

Albanian 10,332,000

I (Obsolete, conservative estimates)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

But still. . . more data is better data!(Mercer quoted by Church)

I Banko and Brill 2001 HLT paper.

I Confusion set disambiguation task.

I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.

I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.

I With 1 billion word training set, learners have not reachedperformance asymptote.

I (Learn language function by simple algorithm that hasaccess to full extension of function.)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

But still. . . more data is better data!(Mercer quoted by Church)

I Banko and Brill 2001 HLT paper.

I Confusion set disambiguation task.

I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.

I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.

I With 1 billion word training set, learners have not reachedperformance asymptote.

I (Learn language function by simple algorithm that hasaccess to full extension of function.)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

But still. . . more data is better data!(Mercer quoted by Church)

I Banko and Brill 2001 HLT paper.

I Confusion set disambiguation task.

I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.

I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.

I With 1 billion word training set, learners have not reachedperformance asymptote.

I (Learn language function by simple algorithm that hasaccess to full extension of function.)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

But still. . . more data is better data!(Mercer quoted by Church)

I Banko and Brill 2001 HLT paper.

I Confusion set disambiguation task.

I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.

I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.

I With 1 billion word training set, learners have not reachedperformance asymptote.

I (Learn language function by simple algorithm that hasaccess to full extension of function.)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

But still. . . more data is better data!(Mercer quoted by Church)

I Banko and Brill 2001 HLT paper.

I Confusion set disambiguation task.

I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.

I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.

I With 1 billion word training set, learners have not reachedperformance asymptote.

I (Learn language function by simple algorithm that hasaccess to full extension of function.)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

But still. . . more data is better data!(Mercer quoted by Church)

I Banko and Brill 2001 HLT paper.

I Confusion set disambiguation task.

I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.

I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.

I With 1 billion word training set, learners have not reachedperformance asymptote.

I (Learn language function by simple algorithm that hasaccess to full extension of function.)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

More web-data is better data!

I Keller and Lapata 2003, Computational Linguistics.

I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:

I correlate with BNC and NANTC frequencies;I correlate with WordNet-class-based smoothed

frequencies;I correlate with human plausibility judgments more than

corpus-based frequencies do (smoothed or notsmoothed).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

More web-data is better data!

I Keller and Lapata 2003, Computational Linguistics.

I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:

I correlate with BNC and NANTC frequencies;I correlate with WordNet-class-based smoothed

frequencies;I correlate with human plausibility judgments more than

corpus-based frequencies do (smoothed or notsmoothed).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

More web-data is better data!

I Keller and Lapata 2003, Computational Linguistics.

I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:

I correlate with BNC and NANTC frequencies;

I correlate with WordNet-class-based smoothedfrequencies;

I correlate with human plausibility judgments more thancorpus-based frequencies do (smoothed or notsmoothed).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

More web-data is better data!

I Keller and Lapata 2003, Computational Linguistics.

I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:

I correlate with BNC and NANTC frequencies;I correlate with WordNet-class-based smoothed

frequencies;

I correlate with human plausibility judgments more thancorpus-based frequencies do (smoothed or notsmoothed).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

More web-data is better data!

I Keller and Lapata 2003, Computational Linguistics.

I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:

I correlate with BNC and NANTC frequencies;I correlate with WordNet-class-based smoothed

frequencies;I correlate with human plausibility judgments more than

corpus-based frequencies do (smoothed or notsmoothed).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Approaches to Web as Corpus

I Collect (frequency) data directly from commercial searchengines (e.g. Turney 2001, many many others).

I Linguist’s friendly interfaces to commercial searchengines: WebCorp, KwicFinder, LSE (Kehoe and Renouf2002, Fletcher 2002, Resnik and and Elkiss 2003).

I Small(-ish), focused crawls of the web to find and retrieverelevant pages (e.g. Ghani et al. 2001, Baroni andBernardini 2004, Sharoff submitted).

I WaCky!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Approaches to Web as Corpus

I Collect (frequency) data directly from commercial searchengines (e.g. Turney 2001, many many others).

I Linguist’s friendly interfaces to commercial searchengines: WebCorp, KwicFinder, LSE (Kehoe and Renouf2002, Fletcher 2002, Resnik and and Elkiss 2003).

I Small(-ish), focused crawls of the web to find and retrieverelevant pages (e.g. Ghani et al. 2001, Baroni andBernardini 2004, Sharoff submitted).

I WaCky!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Approaches to Web as Corpus

I Collect (frequency) data directly from commercial searchengines (e.g. Turney 2001, many many others).

I Linguist’s friendly interfaces to commercial searchengines: WebCorp, KwicFinder, LSE (Kehoe and Renouf2002, Fletcher 2002, Resnik and and Elkiss 2003).

I Small(-ish), focused crawls of the web to find and retrieverelevant pages (e.g. Ghani et al. 2001, Baroni andBernardini 2004, Sharoff submitted).

I WaCky!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Approaches to Web as Corpus

I Collect (frequency) data directly from commercial searchengines (e.g. Turney 2001, many many others).

I Linguist’s friendly interfaces to commercial searchengines: WebCorp, KwicFinder, LSE (Kehoe and Renouf2002, Fletcher 2002, Resnik and and Elkiss 2003).

I Small(-ish), focused crawls of the web to find and retrieverelevant pages (e.g. Ghani et al. 2001, Baroni andBernardini 2004, Sharoff submitted).

I WaCky!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Outline

Introduction

Frequency estimates from search enginesWeb-based Mutual Information

The “linguists’ friendly” interfaces

Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Enter WaCky!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Collecting frequency data from search engines

I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).

I Rough approximation to frequency, but:

I Empirically successful;I Easy: the engine does most of the hard work.

I Web-based mutual information: typical example ofresearch using search engine-based frequency data.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Collecting frequency data from search engines

I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).

I Rough approximation to frequency, but:

I Empirically successful;I Easy: the engine does most of the hard work.

I Web-based mutual information: typical example ofresearch using search engine-based frequency data.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Collecting frequency data from search engines

I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).

I Rough approximation to frequency, but:I Empirically successful;

I Easy: the engine does most of the hard work.

I Web-based mutual information: typical example ofresearch using search engine-based frequency data.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Collecting frequency data from search engines

I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).

I Rough approximation to frequency, but:I Empirically successful;I Easy: the engine does most of the hard work.

I Web-based mutual information: typical example ofresearch using search engine-based frequency data.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Collecting frequency data from search engines

I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).

I Rough approximation to frequency, but:I Empirically successful;I Easy: the engine does most of the hard work.

I Web-based mutual information: typical example ofresearch using search engine-based frequency data.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information (WMI)Turney 2001

I (Pointwise) mutual information:

MI (w1, w2) = log2

P(w1, w2)

P(w1)P(w2)= log2 N

C(w1, w2)

C(w1)C(w2)

I WMI: compute mutual information of word pairs usingfrequency/cooccurrence frequency data extracted fromthe web via AltaVista search engine.

WMI (w1, w2) = log2 Nhits(w1 NEAR w2)

hits(w1)hits(w2)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information (WMI)Turney 2001

I (Pointwise) mutual information:

MI (w1, w2) = log2

P(w1, w2)

P(w1)P(w2)= log2 N

C(w1, w2)

C(w1)C(w2)

I WMI: compute mutual information of word pairs usingfrequency/cooccurrence frequency data extracted fromthe web via AltaVista search engine.

WMI (w1, w2) = log2 Nhits(w1 NEAR w2)

hits(w1)hits(w2)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information (WMI)Turney 2001

I (Pointwise) mutual information:

MI (w1, w2) = log2

P(w1, w2)

P(w1)P(w2)= log2 N

C(w1, w2)

C(w1)C(w2)

I WMI: compute mutual information of word pairs usingfrequency/cooccurrence frequency data extracted fromthe web via AltaVista search engine.

WMI (w1, w2) = log2 Nhits(w1 NEAR w2)

hits(w1)hits(w2)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information

I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).

I Simplicity of method counterbalanced by size of database(the WWW).

I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.

I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.

I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information

I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).

I Simplicity of method counterbalanced by size of database(the WWW).

I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.

I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.

I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information

I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).

I Simplicity of method counterbalanced by size of database(the WWW).

I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.

I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.

I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information

I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).

I Simplicity of method counterbalanced by size of database(the WWW).

I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.

I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.

I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Web-based Mutual Information

I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).

I Simplicity of method counterbalanced by size of database(the WWW).

I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.

I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.

I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI takes the TOEFLTurney 2001

I TOEFL synonym match task.

I Target: levied; Candidates: imposed, believed, requested,correlated.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI takes the TOEFLTurney 2001

I TOEFL synonym match task.

I Target: levied; Candidates: imposed, believed, requested,correlated.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI takes the TOEFLTurney 2001

I TOEFL synonym match task.

I Target: levied; Candidates: imposed, believed, requested,correlated.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI takes the TOEFL (cont.)

I Performance on TOEFL synonym match task:

I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI takes the TOEFL (cont.)

I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%

I Latent Semantic Analysis: 65.4%I WMI: 72.5%

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI takes the TOEFL (cont.)

I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%

I WMI: 72.5%

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI takes the TOEFL (cont.)

I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI and synonym detection in terminology

I Baroni and Bisi 2004 applied WMI-method to synonymmining task in technical domain.

I A harder task:

I Technical terms less frequent than general languageterms (potential data sparseness issues);

I All terms in domain tend to be semantically related, tosome extent.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI and synonym detection in terminology

I Baroni and Bisi 2004 applied WMI-method to synonymmining task in technical domain.

I A harder task:

I Technical terms less frequent than general languageterms (potential data sparseness issues);

I All terms in domain tend to be semantically related, tosome extent.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI and synonym detection in terminology

I Baroni and Bisi 2004 applied WMI-method to synonymmining task in technical domain.

I A harder task:I Technical terms less frequent than general language

terms (potential data sparseness issues);

I All terms in domain tend to be semantically related, tosome extent.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

WMI and synonym detection in terminology

I Baroni and Bisi 2004 applied WMI-method to synonymmining task in technical domain.

I A harder task:I Technical terms less frequent than general language

terms (potential data sparseness issues);I All terms in domain tend to be semantically related, to

some extent.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Materials

I Nautical terminology.

I Terms and relational information from structuredtermbase of Bisi (2003).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Materials

I Nautical terminology.

I Terms and relational information from structuredtermbase of Bisi (2003).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Task

I Given a list of pairs in any order, rank them so thatsynonym pairs will be on top of list.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Task: example

I decks/cockpit

I frames/ribs

I bottom/hull

I ...

I frames/hull

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Task: example

I frames/ribs

I bottom/hull

I decks/cockpit

I ...

I frames/hull

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Task: settings

I Synonym term pairs vs. random term pairs (Exp 1).

I Synonym term pairs vs. other “nymic” pairs (Exp 2).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Task: settings

I Synonym term pairs vs. random term pairs (Exp 1).

I Synonym term pairs vs. other “nymic” pairs (Exp 2).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity

I Term of comparison.

I Intuition: Words with similar patterns of cooccurrence arelikely to be similar.

I Correlation of vectors of cooccurrence frequencies oftargets with (almost) all words in corpus:

cos(−→x ,−→y ) = −→x · −→y =n∑

i=1

xiyi

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity

I Term of comparison.

I Intuition: Words with similar patterns of cooccurrence arelikely to be similar.

I Correlation of vectors of cooccurrence frequencies oftargets with (almost) all words in corpus:

cos(−→x ,−→y ) = −→x · −→y =n∑

i=1

xiyi

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity

I Term of comparison.

I Intuition: Words with similar patterns of cooccurrence arelikely to be similar.

I Correlation of vectors of cooccurrence frequencies oftargets with (almost) all words in corpus:

cos(−→x ,−→y ) = −→x · −→y =n∑

i=1

xiyi

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity (cont.)

I Corpora:

I 1.2M word specialized corpus manually assembled byterminologist;

I 4.27M word corpus constructed via random nauticalterm queries to Google.

I Context windows:

I 2 words to either side of target;I 5 words to either side of target.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity (cont.)

I Corpora:I 1.2M word specialized corpus manually assembled by

terminologist;

I 4.27M word corpus constructed via random nauticalterm queries to Google.

I Context windows:

I 2 words to either side of target;I 5 words to either side of target.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity (cont.)

I Corpora:I 1.2M word specialized corpus manually assembled by

terminologist;I 4.27M word corpus constructed via random nautical

term queries to Google.

I Context windows:

I 2 words to either side of target;I 5 words to either side of target.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity (cont.)

I Corpora:I 1.2M word specialized corpus manually assembled by

terminologist;I 4.27M word corpus constructed via random nautical

term queries to Google.

I Context windows:

I 2 words to either side of target;I 5 words to either side of target.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity (cont.)

I Corpora:I 1.2M word specialized corpus manually assembled by

terminologist;I 4.27M word corpus constructed via random nautical

term queries to Google.

I Context windows:I 2 words to either side of target;

I 5 words to either side of target.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Cosine Similarity (cont.)

I Corpora:I 1.2M word specialized corpus manually assembled by

terminologist;I 4.27M word corpus constructed via random nautical

term queries to Google.

I Context windows:I 2 words to either side of target;I 5 words to either side of target.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 1: Data

I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).

I 124 non-synonym pairs:

I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.

I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 1: Data

I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).

I 124 non-synonym pairs:

I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.

I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 1: Data

I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).

I 124 non-synonym pairs:I 100 random pairs of nautical terms;

I 24 recombinations of terms in synonym set.

I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 1: Data

I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).

I 124 non-synonym pairs:I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.

I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 1: Data

I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).

I 124 non-synonym pairs:I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.

I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 1: ResultsPercentage precision at various percentage recall levels

recall WMI Cosinesman corp man corp web corp web corp

2-word win 5-word win 2-word win 5-word win

12.5 100.0 100.0 60.0 60.0 42.925.0 100.0 75.0 60.0 46.2 46.237.5 90.0 42.9 39.1 40.9 45.050.0 92.3 17.9 19.4 26.7 25.562.5 88.2 10.8 15.0 19.0 17.675.0 36.7 12.7 12.7 12.7 13.487.5 30.4 14.5 14.5 14.5 14.5100.0 16.2 16.2 16.2 16.2 16.2

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: Data

I Same 24 synonym pairs as above.

I 31 nymic pairs from Bisi termbase added to test set:

I 19 cohyponym pairs (e.g., Bruce anchor/mushroomanchor);

I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).

I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: Data

I Same 24 synonym pairs as above.

I 31 nymic pairs from Bisi termbase added to test set:

I 19 cohyponym pairs (e.g., Bruce anchor/mushroomanchor);

I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).

I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: Data

I Same 24 synonym pairs as above.

I 31 nymic pairs from Bisi termbase added to test set:I 19 cohyponym pairs (e.g., Bruce anchor/mushroom

anchor);

I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).

I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: Data

I Same 24 synonym pairs as above.

I 31 nymic pairs from Bisi termbase added to test set:I 19 cohyponym pairs (e.g., Bruce anchor/mushroom

anchor);I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);

I 2 antonyms (e.g., ahead/astern).

I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: Data

I Same 24 synonym pairs as above.

I 31 nymic pairs from Bisi termbase added to test set:I 19 cohyponym pairs (e.g., Bruce anchor/mushroom

anchor);I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).

I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: Data

I Same 24 synonym pairs as above.

I 31 nymic pairs from Bisi termbase added to test set:I 19 cohyponym pairs (e.g., Bruce anchor/mushroom

anchor);I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).

I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: ResultsPercentage precision at various percentage recall levels

recall WMI Cosinesman corp man corp web corp web corp

2-word win 5-word win 2-word win 5-word win

12.5 60.0 42.9 37.5 27.3 20.025.0 33.3 46.2 46.2 28.6 27.337.5 36.0 39.1 39.1 29.0 31.050.0 40.0 19.7 21.1 23.1 22.662.5 37.5 10.8 17.4 19.2 18.175.0 26.5 12.7 12.7 12.7 14.187.5 25.6 14.5 14.5 14.5 14.5100.0 16.2 16.2 16.2 16.2 16.2

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Houston, we have a problem

I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.

I End of the NEAR operator.

I Change of underlying database.

I WMI without NEAR:

WMI (w1, w2) = log2 Nhits(w1 w2)

hits(w1)hits(w2)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Houston, we have a problem

I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.

I End of the NEAR operator.

I Change of underlying database.

I WMI without NEAR:

WMI (w1, w2) = log2 Nhits(w1 w2)

hits(w1)hits(w2)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Houston, we have a problem

I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.

I End of the NEAR operator.

I Change of underlying database.

I WMI without NEAR:

WMI (w1, w2) = log2 Nhits(w1 w2)

hits(w1)hits(w2)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Houston, we have a problem

I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.

I End of the NEAR operator.

I Change of underlying database.

I WMI without NEAR:

WMI (w1, w2) = log2 Nhits(w1 w2)

hits(w1)hits(w2)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Houston, we have a problem

I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.

I End of the NEAR operator.

I Change of underlying database.

I WMI without NEAR:

WMI (w1, w2) = log2 Nhits(w1 w2)

hits(w1)hits(w2)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 1: with and without NEARPercentage precision at various percentage recall levels

recall AltaVista AltaVista Googlew/ NEAR w/o NEAR

12.5 100.0 100.0 100.025.0 100.0 100.0 85.737.5 90.0 100.0 81.850.0 92.3 75 85.762.5 88.2 62.5 60.075.0 36.7 45.0 64.387.5 30.4 34.4 45.6100.0 16.2 19.3 17.3

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Experiment 2: with and without NEARPercentage precision at various percentage recall levels

recall AltaVista AltaVista Googlew/ NEAR w/o NEAR

12.5 60.0 42.8 50.025.0 33.3 50.0 37.537.5 36.0 52.9 45.050.0 40.0 38.7 40.062.5 37.5 32.6 31.975.0 26.5 28.6 34.087.5 25.6 25.6 30.0100.0 16.2 18.5 17.0

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Pros and cons of search engine frequencies

I The main advantage: it’s easy.

I The main problem: we depend on commercial searchengines.

I Linguist’s satisfaction is obviously not their priority.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Pros and cons of search engine frequencies

I The main advantage: it’s easy.

I The main problem: we depend on commercial searchengines.

I Linguist’s satisfaction is obviously not their priority.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Pros and cons of search engine frequencies

I The main advantage: it’s easy.

I The main problem: we depend on commercial searchengines.

I Linguist’s satisfaction is obviously not their priority.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

A telling anecdote

(Talking to a new acquaintance who works at Google)

Me: So, do you guys have plans to introduce the NEARoperator?

The Google Acquaintance: You are a linguist right? Onlylinguists ask about that sort of stuff. . .

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

A telling anecdote

(Talking to a new acquaintance who works at Google)

Me: So, do you guys have plans to introduce the NEARoperator?

The Google Acquaintance: You are a linguist right? Onlylinguists ask about that sort of stuff. . .

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

A telling anecdote

(Talking to a new acquaintance who works at Google)

Me: So, do you guys have plans to introduce the NEARoperator?

The Google Acquaintance: You are a linguist right? Onlylinguists ask about that sort of stuff. . .

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Fletcher 2004 saying the same things

Search engines are not research libraries but commercial enterprisestargeted at the needs of the general public. The availability andimplementation of their services change constantly: features are added ordropped to mimic or outdo the competition; acquisitions and mergersthreaten their independence; financial uncertainties and legal battleschallenge their very survival. The search sites’ quest for revenue candiminish the objectivity of their search results, and various “pageranking” algorithms may lead to results that are not representative of theWeb as a whole. Most frustrating is the minimal support for therequirements of serious researchers: current trends lead away from siteslike AltaVista supporting sophisticated complex queries (which few usersemploy) to ones like Google offering only simple search criteria. In short,the search engines’ services are useful to investigators by coincidence, notdesign, and researchers are tolerated on mainstream search sites only aslong as their use does not affect site performance adversely.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

Worrying data from the Google APIsPattern discovered by Luca Onnis

Query APIs Website Ratio

pleasantly 369000 870000 0.42awkwardly 124000 292000 0.42silent 4610000 11000000 0.42pleasantly silent 107 135 0.79awkwardly silent 396 566 0.70

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

A few more things to worry about

I Google inflating its counts (Veronis’s blog, 2005).

I Is the * operator still supported?

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Web-based Mutual Information

A few more things to worry about

I Google inflating its counts (Veronis’s blog, 2005).

I Is the * operator still supported?

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Outline

Introduction

Frequency estimates from search enginesWeb-based Mutual Information

The “linguists’ friendly” interfaces

Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Enter WaCky!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The “linguist’s friendly” interfaces

I WebCorp, KwicFinder, Linguist’s Search Engine.

I “Wrappers” around Google, AltaVista, etc.

I Nice interfaces, but ultimately inherit all problems ofsearch engines, and perhaps add some more with theirfilters. . .

I E.g., “spongi*” query in webCorp (Stefan Evert).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The “linguist’s friendly” interfaces

I WebCorp, KwicFinder, Linguist’s Search Engine.

I “Wrappers” around Google, AltaVista, etc.

I Nice interfaces, but ultimately inherit all problems ofsearch engines, and perhaps add some more with theirfilters. . .

I E.g., “spongi*” query in webCorp (Stefan Evert).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The “linguist’s friendly” interfaces

I WebCorp, KwicFinder, Linguist’s Search Engine.

I “Wrappers” around Google, AltaVista, etc.

I Nice interfaces, but ultimately inherit all problems ofsearch engines, and perhaps add some more with theirfilters. . .

I E.g., “spongi*” query in webCorp (Stefan Evert).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The “linguist’s friendly” interfaces

I WebCorp, KwicFinder, Linguist’s Search Engine.

I “Wrappers” around Google, AltaVista, etc.

I Nice interfaces, but ultimately inherit all problems ofsearch engines, and perhaps add some more with theirfilters. . .

I E.g., “spongi*” query in webCorp (Stefan Evert).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Outline

Introduction

Frequency estimates from search enginesWeb-based Mutual Information

The “linguists’ friendly” interfaces

Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Enter WaCky!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Building special corpora with search engine queries

I By downloading text, more control over data.

I But less work, more targeted data than spidering yourown corpus.

I Good for “special purposes” corpora:

I “minority” languages (CorpusBuilder; Ghani, Jones,Mladenic, CIKM-2001).;

I specialized sub-languages (BootCaT).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Building special corpora with search engine queries

I By downloading text, more control over data.

I But less work, more targeted data than spidering yourown corpus.

I Good for “special purposes” corpora:

I “minority” languages (CorpusBuilder; Ghani, Jones,Mladenic, CIKM-2001).;

I specialized sub-languages (BootCaT).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Building special corpora with search engine queries

I By downloading text, more control over data.

I But less work, more targeted data than spidering yourown corpus.

I Good for “special purposes” corpora:

I “minority” languages (CorpusBuilder; Ghani, Jones,Mladenic, CIKM-2001).;

I specialized sub-languages (BootCaT).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Building special corpora with search engine queries

I By downloading text, more control over data.

I But less work, more targeted data than spidering yourown corpus.

I Good for “special purposes” corpora:I “minority” languages (CorpusBuilder; Ghani, Jones,

Mladenic, CIKM-2001).;

I specialized sub-languages (BootCaT).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Building special corpora with search engine queries

I By downloading text, more control over data.

I But less work, more targeted data than spidering yourown corpus.

I Good for “special purposes” corpora:I “minority” languages (CorpusBuilder; Ghani, Jones,

Mladenic, CIKM-2001).;I specialized sub-languages (BootCaT).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The BootCaT tools

I Bootstrapping Corpora and Terms from the web.

I Perl scripts freely available from:http://sslmit.unibo.it/∼baroni/bootcat.html

I Original motivation: fast construction of ad-hoc corporaand term lists for translation/interpreting tasks,terminography.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The BootCaT tools

I Bootstrapping Corpora and Terms from the web.

I Perl scripts freely available from:http://sslmit.unibo.it/∼baroni/bootcat.html

I Original motivation: fast construction of ad-hoc corporaand term lists for translation/interpreting tasks,terminography.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The BootCaT tools

I Bootstrapping Corpora and Terms from the web.

I Perl scripts freely available from:http://sslmit.unibo.it/∼baroni/bootcat.html

I Original motivation: fast construction of ad-hoc corporaand term lists for translation/interpreting tasks,terminography.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The BootCaT procedure

Select initial terms

Query Google for random term combinations

Extract new terms via corpus comparison

Retrieve pages and format as text (corpus)

Distributional patterns POS templates

Extract multi-word terms using corpus, uni-terms and...

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Terms and Term Combinations

I 5-20 terms typical of domain.

I Selection: human or automated (e.g. via text/corpuscomparison).

I Seed terms randomly combined into tuples to performGoogle queries:

I Longer tuples: better precision;I Shorter tuples: better recall.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Terms and Term Combinations

I 5-20 terms typical of domain.

I Selection: human or automated (e.g. via text/corpuscomparison).

I Seed terms randomly combined into tuples to performGoogle queries:

I Longer tuples: better precision;I Shorter tuples: better recall.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Terms and Term Combinations

I 5-20 terms typical of domain.

I Selection: human or automated (e.g. via text/corpuscomparison).

I Seed terms randomly combined into tuples to performGoogle queries:

I Longer tuples: better precision;I Shorter tuples: better recall.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Terms and Term Combinations

I 5-20 terms typical of domain.

I Selection: human or automated (e.g. via text/corpuscomparison).

I Seed terms randomly combined into tuples to performGoogle queries:

I Longer tuples: better precision;

I Shorter tuples: better recall.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Terms and Term Combinations

I 5-20 terms typical of domain.

I Selection: human or automated (e.g. via text/corpuscomparison).

I Seed terms randomly combined into tuples to performGoogle queries:

I Longer tuples: better precision;I Shorter tuples: better recall.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison

with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.I Retrieved pages formatted as text (character set issues,

non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on

different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;

2. Extract typical terms through statistical comparisonwith reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.I Retrieved pages formatted as text (character set issues,

non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on

different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison

with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.I Retrieved pages formatted as text (character set issues,

non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on

different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison

with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.I Retrieved pages formatted as text (character set issues,

non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on

different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison

with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.

I Retrieved pages formatted as text (character set issues,non-text format issues; in Japanese: tokenization issues).

I Reference corpus: better if balanced, but any corpus ondifferent topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison

with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.I Retrieved pages formatted as text (character set issues,

non-text format issues; in Japanese: tokenization issues).

I Reference corpus: better if balanced, but any corpus ondifferent topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison

with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.I Retrieved pages formatted as text (character set issues,

non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on

different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004

I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.

I Reference: Brown (1.1M words).

I Corpus comparison: via Log Odds Ratio.

I Two iterations.

I 1.4M word corpus constructed, 1800 unigram termsextracted.

I 20/30 randomly selected documents from corpus rated asrelevant and informative.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004

I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.

I Reference: Brown (1.1M words).

I Corpus comparison: via Log Odds Ratio.

I Two iterations.

I 1.4M word corpus constructed, 1800 unigram termsextracted.

I 20/30 randomly selected documents from corpus rated asrelevant and informative.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004

I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.

I Reference: Brown (1.1M words).

I Corpus comparison: via Log Odds Ratio.

I Two iterations.

I 1.4M word corpus constructed, 1800 unigram termsextracted.

I 20/30 randomly selected documents from corpus rated asrelevant and informative.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004

I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.

I Reference: Brown (1.1M words).

I Corpus comparison: via Log Odds Ratio.

I Two iterations.

I 1.4M word corpus constructed, 1800 unigram termsextracted.

I 20/30 randomly selected documents from corpus rated asrelevant and informative.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004

I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.

I Reference: Brown (1.1M words).

I Corpus comparison: via Log Odds Ratio.

I Two iterations.

I 1.4M word corpus constructed, 1800 unigram termsextracted.

I 20/30 randomly selected documents from corpus rated asrelevant and informative.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004

I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.

I Reference: Brown (1.1M words).

I Corpus comparison: via Log Odds Ratio.

I Two iterations.

I 1.4M word corpus constructed, 1800 unigram termsextracted.

I 20/30 randomly selected documents from corpus rated asrelevant and informative.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Applications

I Languages: English, Italian, Japanese, Spanish, German,French, Russian, Chinese, Danish.

I Domains: medical, legal, meteorology, food, nauticalterminology, (e-)commerce. . .

I Uses: technical translation, interpreting tasks, resourcesfor LSP teaching, populating ontologies, expanding alexicon in systematic ways, general corpus construction(Sharoff submitted).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Applications

I Languages: English, Italian, Japanese, Spanish, German,French, Russian, Chinese, Danish.

I Domains: medical, legal, meteorology, food, nauticalterminology, (e-)commerce. . .

I Uses: technical translation, interpreting tasks, resourcesfor LSP teaching, populating ontologies, expanding alexicon in systematic ways, general corpus construction(Sharoff submitted).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Applications

I Languages: English, Italian, Japanese, Spanish, German,French, Russian, Chinese, Danish.

I Domains: medical, legal, meteorology, food, nauticalterminology, (e-)commerce. . .

I Uses: technical translation, interpreting tasks, resourcesfor LSP teaching, populating ontologies, expanding alexicon in systematic ways, general corpus construction(Sharoff submitted).

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Ongoing and planned work

I Special queries.

I Better character set handling.

I Better pdf/doc conversion.

I Better integration with UCS and other tools.

I Multi-term extraction.

I Yahoo API?

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I We still rely on commercial search engine, but less so.

I We only use most basic query function, less likely tochange.

I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.

I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.

I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I We still rely on commercial search engine, but less so.

I We only use most basic query function, less likely tochange.

I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.

I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.

I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I We still rely on commercial search engine, but less so.

I We only use most basic query function, less likely tochange.

I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.

I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.

I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I We still rely on commercial search engine, but less so.

I We only use most basic query function, less likely tochange.

I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.

I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.

I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I We still rely on commercial search engine, but less so.

I We only use most basic query function, less likely tochange.

I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.

I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.

I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I We still rely on commercial search engine:

I What happens if Google discontinues API service?I What happens if Google does something too smart or

too commercial with the page ranks?

I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.

I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).

I Not for exploiting vastness of web-as-corpus directly.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I We still rely on commercial search engine:I What happens if Google discontinues API service?

I What happens if Google does something too smart ortoo commercial with the page ranks?

I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.

I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).

I Not for exploiting vastness of web-as-corpus directly.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I We still rely on commercial search engine:I What happens if Google discontinues API service?I What happens if Google does something too smart or

too commercial with the page ranks?

I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.

I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).

I Not for exploiting vastness of web-as-corpus directly.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I We still rely on commercial search engine:I What happens if Google discontinues API service?I What happens if Google does something too smart or

too commercial with the page ranks?

I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.

I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).

I Not for exploiting vastness of web-as-corpus directly.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I We still rely on commercial search engine:I What happens if Google discontinues API service?I What happens if Google does something too smart or

too commercial with the page ranks?

I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.

I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).

I Not for exploiting vastness of web-as-corpus directly.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I We still rely on commercial search engine:I What happens if Google discontinues API service?I What happens if Google does something too smart or

too commercial with the page ranks?

I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.

I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).

I Not for exploiting vastness of web-as-corpus directly.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Biting the bullet. . .

I Crawling, cleaning, annotating, managing andmaintaining your own indexed version of the web.

I Obviously, the “ideal” solution.

I But obviously a lot of work!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Biting the bullet. . .

I Crawling, cleaning, annotating, managing andmaintaining your own indexed version of the web.

I Obviously, the “ideal” solution.

I But obviously a lot of work!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Biting the bullet. . .

I Crawling, cleaning, annotating, managing andmaintaining your own indexed version of the web.

I Obviously, the “ideal” solution.

I But obviously a lot of work!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Build your own search engine

I Crawling.

I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )

I Linguistic processing.

I Categorization, meta-data.

I Indexing.

I Interfaces.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Build your own search engine

I Crawling.

I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )

I Linguistic processing.

I Categorization, meta-data.

I Indexing.

I Interfaces.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Build your own search engine

I Crawling.

I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )

I Linguistic processing.

I Categorization, meta-data.

I Indexing.

I Interfaces.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Build your own search engine

I Crawling.

I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )

I Linguistic processing.

I Categorization, meta-data.

I Indexing.

I Interfaces.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Build your own search engine

I Crawling.

I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )

I Linguistic processing.

I Categorization, meta-data.

I Indexing.

I Interfaces.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Build your own search engine

I Crawling.

I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )

I Linguistic processing.

I Categorization, meta-data.

I Indexing.

I Interfaces.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The huge web-corpus of Clarke and collaborators

I Terabyte crawl of the web in 2001.

I From initial seed set of 2392 (English?) educationalURLs.

I No duplicates, not too many pages from same site.

I No language filtering.

I 53 billion words, 77 million documents.

I (BNC has 100 million words; Google indexes 8 billiondocuments.)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The huge web-corpus of Clarke and collaborators

I Terabyte crawl of the web in 2001.

I From initial seed set of 2392 (English?) educationalURLs.

I No duplicates, not too many pages from same site.

I No language filtering.

I 53 billion words, 77 million documents.

I (BNC has 100 million words; Google indexes 8 billiondocuments.)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The huge web-corpus of Clarke and collaborators

I Terabyte crawl of the web in 2001.

I From initial seed set of 2392 (English?) educationalURLs.

I No duplicates, not too many pages from same site.

I No language filtering.

I 53 billion words, 77 million documents.

I (BNC has 100 million words; Google indexes 8 billiondocuments.)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The huge web-corpus of Clarke and collaborators

I Terabyte crawl of the web in 2001.

I From initial seed set of 2392 (English?) educationalURLs.

I No duplicates, not too many pages from same site.

I No language filtering.

I 53 billion words, 77 million documents.

I (BNC has 100 million words; Google indexes 8 billiondocuments.)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The huge web-corpus of Clarke and collaborators

I Terabyte crawl of the web in 2001.

I From initial seed set of 2392 (English?) educationalURLs.

I No duplicates, not too many pages from same site.

I No language filtering.

I 53 billion words, 77 million documents.

I (BNC has 100 million words; Google indexes 8 billiondocuments.)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The huge web-corpus of Clarke and collaborators

I Terabyte crawl of the web in 2001.

I From initial seed set of 2392 (English?) educationalURLs.

I No duplicates, not too many pages from same site.

I No language filtering.

I 53 billion words, 77 million documents.

I (BNC has 100 million words; Google indexes 8 billiondocuments.)

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The TOEFL synonym match test, again

I Target: levied; Candidates: imposed, believed, requested,correlated.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

The TOEFL synonym match test, again

I Target: levied; Candidates: imposed, believed, requested,correlated.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

WMI takes the TOEFL againTerra and Clarke 2003

I Performance on TOEFL synonym match task:

I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

WMI takes the TOEFL againTerra and Clarke 2003

I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%

I Latent Semantic Analysis: 65.4%I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

WMI takes the TOEFL againTerra and Clarke 2003

I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%

I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

WMI takes the TOEFL againTerra and Clarke 2003

I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%

I Terra & Clarke’s WMI: 81.25%

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

WMI takes the TOEFL againTerra and Clarke 2003

I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I Independence from commercial search engines.

I Precious, multi-purpose resource.

I In principle, you can do what you want with it.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I Independence from commercial search engines.

I Precious, multi-purpose resource.

I In principle, you can do what you want with it.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Pros

I Independence from commercial search engines.

I Precious, multi-purpose resource.

I In principle, you can do what you want with it.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I A lot of work.

I Resource-intensive.

I In principle, you can do what you want with it. . .

I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.

I Forget about the “do it yourself with a perl script”approach.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I A lot of work.

I Resource-intensive.

I In principle, you can do what you want with it. . .

I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.

I Forget about the “do it yourself with a perl script”approach.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I A lot of work.

I Resource-intensive.

I In principle, you can do what you want with it. . .

I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.

I Forget about the “do it yourself with a perl script”approach.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I A lot of work.

I Resource-intensive.

I In principle, you can do what you want with it. . .

I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.

I Forget about the “do it yourself with a perl script”approach.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Cons

I A lot of work.

I Resource-intensive.

I In principle, you can do what you want with it. . .

I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.

I Forget about the “do it yourself with a perl script”approach.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Outline

Introduction

Frequency estimates from search enginesWeb-based Mutual Information

The “linguists’ friendly” interfaces

Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Enter WaCky!

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .

I but there are important differences, for example:

I We probably want all data, or perhaps random data, oreven linguistically interesting data, not necessarily mostrelevant data.

I We care about (linguistic) form at least as much asabout content.

I A new challenge in computational linguistics: data arenot given.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .

I but there are important differences, for example:

I We probably want all data, or perhaps random data, oreven linguistically interesting data, not necessarily mostrelevant data.

I We care about (linguistic) form at least as much asabout content.

I A new challenge in computational linguistics: data arenot given.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .

I but there are important differences, for example:I We probably want all data, or perhaps random data, or

even linguistically interesting data, not necessarily mostrelevant data.

I We care about (linguistic) form at least as much asabout content.

I A new challenge in computational linguistics: data arenot given.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .

I but there are important differences, for example:I We probably want all data, or perhaps random data, or

even linguistically interesting data, not necessarily mostrelevant data.

I We care about (linguistic) form at least as much asabout content.

I A new challenge in computational linguistics: data arenot given.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .

I but there are important differences, for example:I We probably want all data, or perhaps random data, or

even linguistically interesting data, not necessarily mostrelevant data.

I We care about (linguistic) form at least as much asabout content.

I A new challenge in computational linguistics: data arenot given.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:

I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;

I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;

I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;

I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;

I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;

I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:I Access speed;

I Updating;I Size;I Content-driven relevance.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:I Access speed;I Updating;

I Size;I Content-driven relevance.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:I Access speed;I Updating;I Size;

I Content-driven relevance.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Enter WaCky! (cont.)

I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:I Access speed;I Updating;I Size;I Content-driven relevance.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The WaCkodules: Where We Are At

I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.

I Crawling: with Heritrix, the Internet Archive crawler.

I Post-processing: current focus on duplicate detection.

I Linguistic annotation, meta-data: nothing yet.

I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.

I Interfaces: work by Stefan Evert.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The WaCkodules: Where We Are At

I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.

I Crawling: with Heritrix, the Internet Archive crawler.

I Post-processing: current focus on duplicate detection.

I Linguistic annotation, meta-data: nothing yet.

I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.

I Interfaces: work by Stefan Evert.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The WaCkodules: Where We Are At

I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.

I Crawling: with Heritrix, the Internet Archive crawler.

I Post-processing: current focus on duplicate detection.

I Linguistic annotation, meta-data: nothing yet.

I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.

I Interfaces: work by Stefan Evert.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The WaCkodules: Where We Are At

I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.

I Crawling: with Heritrix, the Internet Archive crawler.

I Post-processing: current focus on duplicate detection.

I Linguistic annotation, meta-data: nothing yet.

I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.

I Interfaces: work by Stefan Evert.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The WaCkodules: Where We Are At

I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.

I Crawling: with Heritrix, the Internet Archive crawler.

I Post-processing: current focus on duplicate detection.

I Linguistic annotation, meta-data: nothing yet.

I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.

I Interfaces: work by Stefan Evert.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

The WaCkodules: Where We Are At

I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.

I Crawling: with Heritrix, the Internet Archive crawler.

I Post-processing: current focus on duplicate detection.

I Linguistic annotation, meta-data: nothing yet.

I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.

I Interfaces: work by Stefan Evert.

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

A few references

M. Baroni and S. Bernardini. 2004. BootCaT: Bootstrapping corpora and terms fromthe web. LREC 2004.M. Baroni and S. Bisi 2004. Using cooccurrence statistics and the web to discoversynonyms in a specialized language. LREC 2004.M. Banko and E. Brill. 2001. Scaling to very very large corpora for natural languagedisambiguation. ACL 2001.W. Fletcher. 2004. Facilitating the compilation and dissemination of ad-hoc webcorpora. Papers from TALC 2002.R. Ghani, R. Jones, and D. Mladenic. 2001. Mining the web to create minoritylanguage corpora. CIKM 2001.F. Keller and M. Lapata. 2003. Using the web to obtain frequencies for unseenbigrams. Computational Linguistics 29.A. Kilgarriff. 2003. Linguistic search engine. Corpus Linguistics 2003.E. Terra and C. L. A. Clarke. 2003. Frequency estimates for statistical word similaritymeasures. HLT-NAACL-03.P. Turney. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL.ECML-2001.