Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’

IntroductionFrequency estimates from search engines

The “linguists’ friendly” interfacesBuilding your own web corpus

Enter WaCky!

Using the web as a source of linguistic data:experiences, problems and perspectives

Marco Baroni

SSLMIT, University of Bologna

ICST/CNR Roma, April 2005



Enter WaCky!

Outline

Introduction

Frequency estimates from search enginesWeb-based Mutual Information

The “linguists’ friendly” interfaces

Building your own web corpusSmall corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Enter WaCky!



Enter WaCky!

The Web as Corpus

I Computational/corpus linguists, lexicographers,ontologists, language technologists constantly hungry fordata.

I The web is a huge database of documents, mostly text.

I Kilgarriff: The web is the most exciting thing thathappened to human beings in the last 20 years or so, andit’s all about linguistic communication – we linguists arein a good position to lead the study of it!!!



Enter WaCky!

The Web as Corpus






Enter WaCky!

The Web as Corpus






Enter WaCky!

The Web as Corpus (cont.)

I Kilgarriff and Grefenstette, Introduction to the SpecialIssue on the Web as Corpus, Computational Linguistics2003.

English 76,598,718,000German 7,035,850,000Italian 1,845,026,000

Finnish 326,379,000Esperanto 57,154,000

Latin 55,943,000Basque 55,340,000

Albanian 10,332,000

I (Obsolete, conservative estimates)



Enter WaCky!

Some General Problems

I Web is not balanced corpus.

I More worryingly: if you use search engine, no control overdata.

I Constantly changing.

I Many languages, a lot of non-native English.

I Python.

I Google frequency of “colorless green ideas sleepfuriously”: 13,000.

I Desperately seeking Blaberus Giganteus.

I . . .



Enter WaCky!






I Python.



I . . .



Enter WaCky!






I Python.



I . . .



Enter WaCky!






I Python.



I . . .



Enter WaCky!






I Python.



I . . .



Enter WaCky!






I Python.



I . . .



Enter WaCky!






I Python.



I . . .



Enter WaCky!






I Python.



I . . .



Enter WaCky!

But still. . . more data is better data!(Mercer quoted by Church)

I Banko and Brill 2001 HLT paper.

I Confusion set disambiguation task.

I Choose correct word in context from set of words it istypically confused with: affect/effect, principal/principle.

I Even most naive learning algorithm trained on 10M wordtraining set outperforms any learner trained on 1M wordtraining set.

I With 1 billion word training set, learners have not reachedperformance asymptote.

I (Learn language function by simple algorithm that hasaccess to full extension of function.)



Enter WaCky!










Enter WaCky!










Enter WaCky!










Enter WaCky!










Enter WaCky!










Enter WaCky!

More web-data is better data!

I Keller and Lapata 2003, Computational Linguistics.

I Google- and AltaVista-based frequencies of A N, N N andV N bigrams:

I correlate with BNC and NANTC frequencies;I correlate with WordNet-class-based smoothed

frequencies;I correlate with human plausibility judgments more than

corpus-based frequencies do (smoothed or notsmoothed).



Enter WaCky!









Enter WaCky!




I correlate with BNC and NANTC frequencies;

I correlate with WordNet-class-based smoothedfrequencies;

I correlate with human plausibility judgments more thancorpus-based frequencies do (smoothed or notsmoothed).



Enter WaCky!





frequencies;

I correlate with human plausibility judgments more thancorpus-based frequencies do (smoothed or notsmoothed).



Enter WaCky!









Enter WaCky!

Approaches to Web as Corpus

I Collect (frequency) data directly from commercial searchengines (e.g. Turney 2001, many many others).

I Linguist’s friendly interfaces to commercial searchengines: WebCorp, KwicFinder, LSE (Kehoe and Renouf2002, Fletcher 2002, Resnik and and Elkiss 2003).

I Small(-ish), focused crawls of the web to find and retrieverelevant pages (e.g. Ghani et al. 2001, Baroni andBernardini 2004, Sharoff submitted).

I WaCky!



Enter WaCky!





I WaCky!



Enter WaCky!





I WaCky!



Enter WaCky!





I WaCky!



Enter WaCky!

Web-based Mutual Information

Outline

Introduction




Enter WaCky!



Enter WaCky!


Collecting frequency data from search engines

I Probably the most popular method (Keller and Lapata2003, Turney 2001, many others).

I Rough approximation to frequency, but:

I Empirically successful;I Easy: the engine does most of the hard work.

I Web-based mutual information: typical example ofresearch using search engine-based frequency data.



Enter WaCky!




I Rough approximation to frequency, but:

I Empirically successful;I Easy: the engine does most of the hard work.




Enter WaCky!




I Rough approximation to frequency, but:I Empirically successful;

I Easy: the engine does most of the hard work.




Enter WaCky!




I Rough approximation to frequency, but:I Empirically successful;I Easy: the engine does most of the hard work.




Enter WaCky!




I Rough approximation to frequency, but:I Empirically successful;I Easy: the engine does most of the hard work.




Enter WaCky!


Web-based Mutual Information (WMI)Turney 2001

I (Pointwise) mutual information:

MI (w1, w2) = log2

P(w1, w2)

P(w1)P(w2)= log2 N

C(w1, w2)

C(w1)C(w2)

I WMI: compute mutual information of word pairs usingfrequency/cooccurrence frequency data extracted fromthe web via AltaVista search engine.

WMI (w1, w2) = log2 Nhits(w1 NEAR w2)

hits(w1)hits(w2)



Enter WaCky!




MI (w1, w2) = log2

P(w1, w2)

P(w1)P(w2)= log2 N

C(w1, w2)

C(w1)C(w2)



hits(w1)hits(w2)



Enter WaCky!




MI (w1, w2) = log2

P(w1, w2)

P(w1)P(w2)= log2 N

C(w1, w2)

C(w1)C(w2)



hits(w1)hits(w2)



Enter WaCky!



I Semantic similarity as direct cooccurrence (vs. occurrencein similar contexts).

I Simplicity of method counterbalanced by size of database(the WWW).

I Very effective: Turney 2001, Lin et al. 2003, Turney andLittman 2003.

I Most researchers report that WMI outperforms moresophisticated methods based on smaller corpora.

I My own experience with WMI: Baroni and Bisi 2004,Baroni and Vegnaduzzo 2004.



Enter WaCky!










Enter WaCky!










Enter WaCky!










Enter WaCky!










Enter WaCky!


WMI takes the TOEFLTurney 2001

I TOEFL synonym match task.

I Target: levied; Candidates: imposed, believed, requested,correlated.



Enter WaCky!







Enter WaCky!







Enter WaCky!


WMI takes the TOEFL (cont.)

I Performance on TOEFL synonym match task:

I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%



Enter WaCky!



I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%

I Latent Semantic Analysis: 65.4%I WMI: 72.5%



Enter WaCky!



I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%

I WMI: 72.5%



Enter WaCky!



I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%



Enter WaCky!


WMI and synonym detection in terminology

I Baroni and Bisi 2004 applied WMI-method to synonymmining task in technical domain.

I A harder task:

I Technical terms less frequent than general languageterms (potential data sparseness issues);

I All terms in domain tend to be semantically related, tosome extent.



Enter WaCky!




I A harder task:

I Technical terms less frequent than general languageterms (potential data sparseness issues);




Enter WaCky!




I A harder task:I Technical terms less frequent than general language

terms (potential data sparseness issues);




Enter WaCky!




I A harder task:I Technical terms less frequent than general language

terms (potential data sparseness issues);I All terms in domain tend to be semantically related, to

some extent.



Enter WaCky!


Materials

I Nautical terminology.

I Terms and relational information from structuredtermbase of Bisi (2003).



Enter WaCky!


Materials

I Nautical terminology.

I Terms and relational information from structuredtermbase of Bisi (2003).



Enter WaCky!


Task

I Given a list of pairs in any order, rank them so thatsynonym pairs will be on top of list.



Enter WaCky!


Task: example

I decks/cockpit

I frames/ribs

I bottom/hull

I ...

I frames/hull



Enter WaCky!


Task: example

I frames/ribs

I bottom/hull

I decks/cockpit

I ...

I frames/hull



Enter WaCky!


Task: settings

I Synonym term pairs vs. random term pairs (Exp 1).

I Synonym term pairs vs. other “nymic” pairs (Exp 2).



Enter WaCky!


Task: settings

I Synonym term pairs vs. random term pairs (Exp 1).

I Synonym term pairs vs. other “nymic” pairs (Exp 2).



Enter WaCky!


Cosine Similarity

I Term of comparison.

I Intuition: Words with similar patterns of cooccurrence arelikely to be similar.

I Correlation of vectors of cooccurrence frequencies oftargets with (almost) all words in corpus:

cos(−→x ,−→y ) = −→x · −→y =n∑

i=1

xiyi



Enter WaCky!


Cosine Similarity




cos(−→x ,−→y ) = −→x · −→y =n∑

i=1

xiyi



Enter WaCky!


Cosine Similarity




cos(−→x ,−→y ) = −→x · −→y =n∑

i=1

xiyi



Enter WaCky!


Cosine Similarity (cont.)

I Corpora:

I 1.2M word specialized corpus manually assembled byterminologist;

I 4.27M word corpus constructed via random nauticalterm queries to Google.

I Context windows:

I 2 words to either side of target;I 5 words to either side of target.



Enter WaCky!



I Corpora:I 1.2M word specialized corpus manually assembled by

terminologist;

I 4.27M word corpus constructed via random nauticalterm queries to Google.

I Context windows:




Enter WaCky!




terminologist;I 4.27M word corpus constructed via random nautical

term queries to Google.

I Context windows:




Enter WaCky!






I Context windows:




Enter WaCky!






I Context windows:I 2 words to either side of target;

I 5 words to either side of target.



Enter WaCky!






I Context windows:I 2 words to either side of target;I 5 words to either side of target.



Enter WaCky!


Experiment 1: Data

I 24 synonym pairs (e.g., bottom/hull, frames/ribs,displacement/weight).

I 124 non-synonym pairs:

I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.

I 29% of random pairs rated “strongly semantically related”(e.g., awning/stern board, install/hatch, keel/coated).



Enter WaCky!


Experiment 1: Data


I 124 non-synonym pairs:

I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.




Enter WaCky!


Experiment 1: Data


I 124 non-synonym pairs:I 100 random pairs of nautical terms;

I 24 recombinations of terms in synonym set.




Enter WaCky!


Experiment 1: Data


I 124 non-synonym pairs:I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.




Enter WaCky!


Experiment 1: Data


I 124 non-synonym pairs:I 100 random pairs of nautical terms;I 24 recombinations of terms in synonym set.




Enter WaCky!


Experiment 1: ResultsPercentage precision at various percentage recall levels

recall WMI Cosinesman corp man corp web corp web corp

2-word win 5-word win 2-word win 5-word win

12.5 100.0 100.0 60.0 60.0 42.925.0 100.0 75.0 60.0 46.2 46.237.5 90.0 42.9 39.1 40.9 45.050.0 92.3 17.9 19.4 26.7 25.562.5 88.2 10.8 15.0 19.0 17.675.0 36.7 12.7 12.7 12.7 13.487.5 30.4 14.5 14.5 14.5 14.5100.0 16.2 16.2 16.2 16.2 16.2



Enter WaCky!


Experiment 2: Data

I Same 24 synonym pairs as above.

I 31 nymic pairs from Bisi termbase added to test set:

I 19 cohyponym pairs (e.g., Bruce anchor/mushroomanchor);

I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).

I 31 randomly selected non-synonym pairs removed fromtest set (same synonym-to-non-synonym pair ratio asabove).



Enter WaCky!


Experiment 2: Data


I 31 nymic pairs from Bisi termbase added to test set:

I 19 cohyponym pairs (e.g., Bruce anchor/mushroomanchor);





Enter WaCky!


Experiment 2: Data


I 31 nymic pairs from Bisi termbase added to test set:I 19 cohyponym pairs (e.g., Bruce anchor/mushroom

anchor);





Enter WaCky!


Experiment 2: Data



anchor);I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);

I 2 antonyms (e.g., ahead/astern).




Enter WaCky!


Experiment 2: Data



anchor);I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).




Enter WaCky!


Experiment 2: Data



anchor);I 10 hypo/hypernym pairs (e.g., stern platform/sun deck);I 2 antonyms (e.g., ahead/astern).




Enter WaCky!


Experiment 2: ResultsPercentage precision at various percentage recall levels

recall WMI Cosinesman corp man corp web corp web corp

2-word win 5-word win 2-word win 5-word win

12.5 60.0 42.9 37.5 27.3 20.025.0 33.3 46.2 46.2 28.6 27.337.5 36.0 39.1 39.1 29.0 31.050.0 40.0 19.7 21.1 23.1 22.662.5 37.5 10.8 17.4 19.2 18.175.0 26.5 12.7 12.7 12.7 14.187.5 25.6 14.5 14.5 14.5 14.5100.0 16.2 16.2 16.2 16.2 16.2



Enter WaCky!


Houston, we have a problem

I On 31 March 2004, AltaVista’s parent company Yahoo!replaced the AltaVista’s engine with Yahoo!’s own engine.

I End of the NEAR operator.

I Change of underlying database.

I WMI without NEAR:

WMI (w1, w2) = log2 Nhits(w1 w2)

hits(w1)hits(w2)



Enter WaCky!






I WMI without NEAR:


hits(w1)hits(w2)



Enter WaCky!






I WMI without NEAR:


hits(w1)hits(w2)



Enter WaCky!






I WMI without NEAR:


hits(w1)hits(w2)



Enter WaCky!






I WMI without NEAR:


hits(w1)hits(w2)



Enter WaCky!


Experiment 1: with and without NEARPercentage precision at various percentage recall levels

recall AltaVista AltaVista Googlew/ NEAR w/o NEAR

12.5 100.0 100.0 100.025.0 100.0 100.0 85.737.5 90.0 100.0 81.850.0 92.3 75 85.762.5 88.2 62.5 60.075.0 36.7 45.0 64.387.5 30.4 34.4 45.6100.0 16.2 19.3 17.3



Enter WaCky!


Experiment 2: with and without NEARPercentage precision at various percentage recall levels

recall AltaVista AltaVista Googlew/ NEAR w/o NEAR

12.5 60.0 42.8 50.025.0 33.3 50.0 37.537.5 36.0 52.9 45.050.0 40.0 38.7 40.062.5 37.5 32.6 31.975.0 26.5 28.6 34.087.5 25.6 25.6 30.0100.0 16.2 18.5 17.0



Enter WaCky!


Pros and cons of search engine frequencies

I The main advantage: it’s easy.

I The main problem: we depend on commercial searchengines.

I Linguist’s satisfaction is obviously not their priority.



Enter WaCky!








Enter WaCky!








Enter WaCky!


A telling anecdote

(Talking to a new acquaintance who works at Google)

Me: So, do you guys have plans to introduce the NEARoperator?

The Google Acquaintance: You are a linguist right? Onlylinguists ask about that sort of stuff. . .



Enter WaCky!


A telling anecdote






Enter WaCky!


A telling anecdote






Enter WaCky!


Consequences

I Limited query options (not even diacritics and accents),limited research options.

I You must know the words you are looking for.

I No annotation, few, unreliable metadata.

I Automated querying constraints, over-querying stronglydiscouraged.

I We know very little about the data we get.

I No control over how search engines evolve.

I Brittleness!



Enter WaCky!


Consequences







I Brittleness!



Enter WaCky!


Consequences







I Brittleness!



Enter WaCky!


Consequences







I Brittleness!



Enter WaCky!


Consequences







I Brittleness!



Enter WaCky!


Consequences







I Brittleness!



Enter WaCky!


Consequences







I Brittleness!



Enter WaCky!


Fletcher 2004 saying the same things

Search engines are not research libraries but commercial enterprisestargeted at the needs of the general public. The availability andimplementation of their services change constantly: features are added ordropped to mimic or outdo the competition; acquisitions and mergersthreaten their independence; financial uncertainties and legal battleschallenge their very survival. The search sites’ quest for revenue candiminish the objectivity of their search results, and various “pageranking” algorithms may lead to results that are not representative of theWeb as a whole. Most frustrating is the minimal support for therequirements of serious researchers: current trends lead away from siteslike AltaVista supporting sophisticated complex queries (which few usersemploy) to ones like Google offering only simple search criteria. In short,the search engines’ services are useful to investigators by coincidence, notdesign, and researchers are tolerated on mainstream search sites only aslong as their use does not affect site performance adversely.



Enter WaCky!


Worrying data from the Google APIsPattern discovered by Luca Onnis

Query APIs Website Ratio

pleasantly 369000 870000 0.42awkwardly 124000 292000 0.42silent 4610000 11000000 0.42pleasantly silent 107 135 0.79awkwardly silent 396 566 0.70



Enter WaCky!


A few more things to worry about

I Google inflating its counts (Veronis’s blog, 2005).

I Is the * operator still supported?



Enter WaCky!


A few more things to worry about

I Google inflating its counts (Veronis’s blog, 2005).

I Is the * operator still supported?



Enter WaCky!

Outline

Introduction




Enter WaCky!



Enter WaCky!

The “linguist’s friendly” interfaces

I WebCorp, KwicFinder, Linguist’s Search Engine.

I “Wrappers” around Google, AltaVista, etc.

I Nice interfaces, but ultimately inherit all problems ofsearch engines, and perhaps add some more with theirfilters. . .

I E.g., “spongi*” query in webCorp (Stefan Evert).



Enter WaCky!








Enter WaCky!








Enter WaCky!








Enter WaCky!

Small corpora via search engine queriesThinking Big: The “real” Linguist’s Search Engine

Outline

Introduction




Enter WaCky!



Enter WaCky!


Building special corpora with search engine queries

I By downloading text, more control over data.

I But less work, more targeted data than spidering yourown corpus.

I Good for “special purposes” corpora:

I “minority” languages (CorpusBuilder; Ghani, Jones,Mladenic, CIKM-2001).;

I specialized sub-languages (BootCaT).



Enter WaCky!










Enter WaCky!










Enter WaCky!





I Good for “special purposes” corpora:I “minority” languages (CorpusBuilder; Ghani, Jones,

Mladenic, CIKM-2001).;




Enter WaCky!





I Good for “special purposes” corpora:I “minority” languages (CorpusBuilder; Ghani, Jones,

Mladenic, CIKM-2001).;I specialized sub-languages (BootCaT).



Enter WaCky!


The BootCaT tools

I Bootstrapping Corpora and Terms from the web.

I Perl scripts freely available from:http://sslmit.unibo.it/∼baroni/bootcat.html

I Original motivation: fast construction of ad-hoc corporaand term lists for translation/interpreting tasks,terminography.



Enter WaCky!


The BootCaT tools






Enter WaCky!


The BootCaT tools






Enter WaCky!


The BootCaT procedure

Select initial terms

Query Google for random term combinations

Extract new terms via corpus comparison

Retrieve pages and format as text (corpus)

Distributional patterns POS templates

Extract multi-word terms using corpus, uni-terms and...



Enter WaCky!


Terms and Term Combinations

I 5-20 terms typical of domain.

I Selection: human or automated (e.g. via text/corpuscomparison).

I Seed terms randomly combined into tuples to performGoogle queries:

I Longer tuples: better precision;I Shorter tuples: better recall.



Enter WaCky!









Enter WaCky!









Enter WaCky!






I Longer tuples: better precision;

I Shorter tuples: better recall.



Enter WaCky!









Enter WaCky!


Corpus/Term Bootstrapping

I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;2. Extract typical terms through statistical comparison

with reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);

3. Use found terms as new seeds and build new randomtuples;

4. Go back to 1.I Retrieved pages formatted as text (character set issues,

non-text format issues; in Japanese: tokenization issues).I Reference corpus: better if balanced, but any corpus on

different topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)



Enter WaCky!



I The bootstrap:

1. Retrieve corpus from web via Google tuple queries;

2. Extract typical terms through statistical comparisonwith reference corpus (using Mutual Information,Log-Likelihood Ratio, etc.);







Enter WaCky!



I The bootstrap:









Enter WaCky!



I The bootstrap:









Enter WaCky!



I The bootstrap:




4. Go back to 1.

I Retrieved pages formatted as text (character set issues,non-text format issues; in Japanese: tokenization issues).

I Reference corpus: better if balanced, but any corpus ondifferent topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)



Enter WaCky!



I The bootstrap:





non-text format issues; in Japanese: tokenization issues).

I Reference corpus: better if balanced, but any corpus ondifferent topic will usually do (but in Japanese register ofcorpora turns out to be crucial!)



Enter WaCky!



I The bootstrap:









Enter WaCky!


Example 1: Pseudo-seizures in EnglishBaroni and Bernardini 2004

I Seed terms: dissociative, epilepsy, interventions,posttraumatic, pseudoseizures, ptsd.

I Reference: Brown (1.1M words).

I Corpus comparison: via Log Odds Ratio.

I Two iterations.

I 1.4M word corpus constructed, 1800 unigram termsextracted.

I 20/30 randomly selected documents from corpus rated asrelevant and informative.



Enter WaCky!






I Two iterations.





Enter WaCky!






I Two iterations.





Enter WaCky!






I Two iterations.





Enter WaCky!






I Two iterations.





Enter WaCky!






I Two iterations.





Enter WaCky!


Example 2: Hotel terminology in JapaneseBaroni and Ueyama 2004

I 20 manually selected initial terms.

I 3.5M word reference corpus built with BootCaT usingrandom elementary Japanese words as seeds.

I Corpus comparison: via MI and Log Likelihood Ratio.

I Three iterations.

I 1.3M word corpus constructed, 424 terms extracted.

I 76/90 randomly selected documents assigned highestrelevance/informativeness rating.

I 58.4% terms rated very relevant, 81.7% rated at leastsomewhat relevant.



Enter WaCky!






I Three iterations.






Enter WaCky!






I Three iterations.






Enter WaCky!






I Three iterations.






Enter WaCky!






I Three iterations.






Enter WaCky!






I Three iterations.






Enter WaCky!






I Three iterations.






Enter WaCky!


Applications

I Languages: English, Italian, Japanese, Spanish, German,French, Russian, Chinese, Danish.

I Domains: medical, legal, meteorology, food, nauticalterminology, (e-)commerce. . .

I Uses: technical translation, interpreting tasks, resourcesfor LSP teaching, populating ontologies, expanding alexicon in systematic ways, general corpus construction(Sharoff submitted).



Enter WaCky!


Applications






Enter WaCky!


Applications






Enter WaCky!


Ongoing and planned work

I Special queries.

I Better character set handling.

I Better pdf/doc conversion.

I Better integration with UCS and other tools.

I Multi-term extraction.

I Yahoo API?



Enter WaCky!


Pros

I We still rely on commercial search engine, but less so.

I We only use most basic query function, less likely tochange.

I Language filtering and good relevance-ranking are crucialcharacteristics of successful search engines.

I We are less likely to bother engine by over-querying, sincewith one query we can obtain MBs of data.

I We have full control over data (e.g. frequency counts,parsing, manual URL filtering) because we downloadthem.



Enter WaCky!


Pros








Enter WaCky!


Pros








Enter WaCky!


Pros








Enter WaCky!


Pros








Enter WaCky!


Cons

I We still rely on commercial search engine:

I What happens if Google discontinues API service?I What happens if Google does something too smart or

too commercial with the page ranks?

I Good for content-driven corpus building, problems withsyntax/style/genre-based filtering.

I Good for building small, targeted-corpora (but seeSharoff’s – and Ciaramita’s? – work).

I Not for exploiting vastness of web-as-corpus directly.



Enter WaCky!


Cons

I We still rely on commercial search engine:I What happens if Google discontinues API service?

I What happens if Google does something too smart ortoo commercial with the page ranks?






Enter WaCky!


Cons

I We still rely on commercial search engine:I What happens if Google discontinues API service?I What happens if Google does something too smart or







Enter WaCky!


Cons








Enter WaCky!


Cons








Enter WaCky!


Cons








Enter WaCky!


Biting the bullet. . .

I Crawling, cleaning, annotating, managing andmaintaining your own indexed version of the web.

I Obviously, the “ideal” solution.

I But obviously a lot of work!



Enter WaCky!








Enter WaCky!








Enter WaCky!


Build your own search engine

I Crawling.

I Post-processing (html/boilerplate stripping, languagerecogntion, duplicate detection, “connected prose”recognition. . . )

I Linguistic processing.

I Categorization, meta-data.

I Indexing.

I Interfaces.



Enter WaCky!



I Crawling.




I Indexing.

I Interfaces.



Enter WaCky!



I Crawling.




I Indexing.

I Interfaces.



Enter WaCky!



I Crawling.




I Indexing.

I Interfaces.



Enter WaCky!



I Crawling.




I Indexing.

I Interfaces.



Enter WaCky!



I Crawling.




I Indexing.

I Interfaces.



Enter WaCky!


The huge web-corpus of Clarke and collaborators

I Terabyte crawl of the web in 2001.

I From initial seed set of 2392 (English?) educationalURLs.

I No duplicates, not too many pages from same site.

I No language filtering.

I 53 billion words, 77 million documents.

I (BNC has 100 million words; Google indexes 8 billiondocuments.)



Enter WaCky!











Enter WaCky!











Enter WaCky!











Enter WaCky!











Enter WaCky!











Enter WaCky!


The TOEFL synonym match test, again




Enter WaCky!


The TOEFL synonym match test, again




Enter WaCky!


WMI takes the TOEFL againTerra and Clarke 2003

I Performance on TOEFL synonym match task:

I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%



Enter WaCky!



I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%

I Latent Semantic Analysis: 65.4%I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%



Enter WaCky!



I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%

I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%



Enter WaCky!



I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%

I Terra & Clarke’s WMI: 81.25%



Enter WaCky!



I Performance on TOEFL synonym match task:I Average foreign test taker: 64.5%I Latent Semantic Analysis: 65.4%I WMI: 72.5%I Terra & Clarke’s WMI: 81.25%



Enter WaCky!


Pros

I Independence from commercial search engines.

I Precious, multi-purpose resource.

I In principle, you can do what you want with it.



Enter WaCky!


Pros






Enter WaCky!


Pros






Enter WaCky!


Cons

I A lot of work.

I Resource-intensive.

I In principle, you can do what you want with it. . .

I In practice, almost anything you want to do with aterabyte corpus will be extremely complicated to do.

I Forget about the “do it yourself with a perl script”approach.



Enter WaCky!


Cons

I A lot of work.







Enter WaCky!


Cons

I A lot of work.







Enter WaCky!


Cons

I A lot of work.







Enter WaCky!


Cons

I A lot of work.







Enter WaCky!

Outline

Introduction




Enter WaCky!



Enter WaCky!

Enter WaCky!

I The Web-as-Corpus kool ynitiative.

I http://wacky.sslmit.unibo.it/

I WaCky crowd: Marco, Massi, Silvia Bernardini, StefanEvert, Bill Fletcher, Adam Kilgarriff. . .

I Yet Another Linguist’s Search Engine proposal (see also:Kilgarriff 2003, Fletcher 2004).

I The WaCky philosophy: try to get something concreteout there very soon, so that other will feel motivated tocontribute.

I 3 1-billion word corpora (English, German, Italian) byspring 2006.

I Web interface(s) and an open source toolkit.



Enter WaCky!

Enter WaCky!










Enter WaCky!

Enter WaCky!










Enter WaCky!

Enter WaCky!










Enter WaCky!

Enter WaCky!










Enter WaCky!

Enter WaCky!










Enter WaCky!

Enter WaCky!










Enter WaCky!

Enter WaCky! (cont.)

I We must learn from IR and massive dataset studies (e.g.,near duplicate detection, fast retrieval). . .

I but there are important differences, for example:

I We probably want all data, or perhaps random data, oreven linguistically interesting data, not necessarily mostrelevant data.

I We care about (linguistic) form at least as much asabout content.

I A new challenge in computational linguistics: data arenot given.



Enter WaCky!



I but there are important differences, for example:

I We probably want all data, or perhaps random data, oreven linguistically interesting data, not necessarily mostrelevant data.





Enter WaCky!



I but there are important differences, for example:I We probably want all data, or perhaps random data, or

even linguistically interesting data, not necessarily mostrelevant data.





Enter WaCky!









Enter WaCky!









Enter WaCky!


I Emphasis on:

I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:

I Access speed;I Updating;I Size;I Content-driven relevance.



Enter WaCky!


I Emphasis on:I Transparency;

I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:




Enter WaCky!


I Emphasis on:I Transparency;I Stability;

I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:




Enter WaCky!


I Emphasis on:I Transparency;I Stability;I Pre-processing;

I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:




Enter WaCky!


I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;

I (Also) automated access;I Sophisticated query options.

I Not so important:




Enter WaCky!


I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;

I Sophisticated query options.

I Not so important:




Enter WaCky!


I Emphasis on:I Transparency;I Stability;I Pre-processing;I Categorization and annotation;I (Also) automated access;I Sophisticated query options.

I Not so important:




Enter WaCky!



I Not so important:




Enter WaCky!



I Not so important:I Access speed;

I Updating;I Size;I Content-driven relevance.



Enter WaCky!



I Not so important:I Access speed;I Updating;

I Size;I Content-driven relevance.



Enter WaCky!



I Not so important:I Access speed;I Updating;I Size;

I Content-driven relevance.



Enter WaCky!



I Not so important:I Access speed;I Updating;I Size;I Content-driven relevance.



Enter WaCky!

The WaCkodules: Where We Are At

I Seeding the Crawls: BNC/Google seeding experimentsand Massi’s measures of randomness.

I Crawling: with Heritrix, the Internet Archive crawler.

I Post-processing: current focus on duplicate detection.

I Linguistic annotation, meta-data: nothing yet.

I Indexing: Lucene vs. the newly open (!) IMS CorpusWorkBench.

I Interfaces: work by Stefan Evert.



Enter WaCky!










Enter WaCky!










Enter WaCky!










Enter WaCky!










Enter WaCky!










Enter WaCky!

A few references

M. Baroni and S. Bernardini. 2004. BootCaT: Bootstrapping corpora and terms fromthe web. LREC 2004.M. Baroni and S. Bisi 2004. Using cooccurrence statistics and the web to discoversynonyms in a specialized language. LREC 2004.M. Banko and E. Brill. 2001. Scaling to very very large corpora for natural languagedisambiguation. ACL 2001.W. Fletcher. 2004. Facilitating the compilation and dissemination of ad-hoc webcorpora. Papers from TALC 2002.R. Ghani, R. Jones, and D. Mladenic. 2001. Mining the web to create minoritylanguage corpora. CIKM 2001.F. Keller and M. Lapata. 2003. Using the web to obtain frequencies for unseenbigrams. Computational Linguistics 29.A. Kilgarriff. 2003. Linguistic search engine. Corpus Linguistics 2003.E. Terra and C. L. A. Clarke. 2003. Frequency estimates for statistical word similaritymeasures. HLT-NAACL-03.P. Turney. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL.ECML-2001.

Documents

Using the web as a source of linguistic data: experiences ...sslmit.unibo.it/~baroni/publications/roma_wac_talk.pdf · Introduction Frequency estimates from search engines The “linguists’