29
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨utze’s, linked from http://informationretrieval.org/ IR 3: Term Statistics and Discussion 1 Paul Ginsparg Cornell University, Ithaca, NY 1 Sep 2010 1 / 29

INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

INFO 4300 / CS4300

Information Retrieval

slides adapted from Hinrich Schutze’s,linked from http://informationretrieval.org/

IR 3: Term Statistics and Discussion 1

Paul Ginsparg

Cornell University, Ithaca, NY

1 Sep 2010

1 / 29

Page 2: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Administrativa (tentative)

Course Webpage:http://www.infosci.cornell.edu/Courses/info4300/2011fa/

Lectures: Tuesday and Thursday 11:40-12:55, Kimball B11

Instructor: Paul Ginsparg, ginsparg@..., 255-7371,Physical Sciences Building 452

Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mailinstructor to schedule an appointment

Teaching Assistant: Saeed Abdullah, [email protected]

Course text at: http://informationretrieval.org/Introduction to Information Retrieval , C.Manning, P.Raghavan, H.Schutze

see alsoInformation Retrieval , S. Buttcher, C. Clarke, G. Cormack

http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307

2 / 29

Page 3: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Overview

1 Recap

2 Term Statistics

3 Discussion

3 / 29

Page 4: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Outline

1 Recap

2 Term Statistics

3 Discussion

4 / 29

Page 5: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Type/token distinction

Token – An instance of a word or term occurring in adocument.

Type – An equivalence class of tokens.

In June, the dog likes to chase the cat in the barn.

12 word tokens, 9 word types

5 / 29

Page 6: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Problems in tokenization

What are the delimiters? Space? Apostrophe? Hyphen?

For each of these: sometimes they delimit, sometimes theydon’t.

No whitespace in many languages! (e.g., Chinese)

No whitespace in Dutch, German, Swedish compounds(Lebensversicherungsgesellschaftsangestellter)

6 / 29

Page 7: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Problems in “equivalence classing”

A term is an equivalence class of tokens.

How do we define equivalence classes?

Numbers (3/20/91 vs. 20/3/91)

Case folding

Stemming, Porter stemmer

Morphological analysis: inflectional vs. derivational

Equivalence classing problems in other languages

More complex morphology than in EnglishFinnish: a single verb may have 12,000 different formsAccents, umlauts

7 / 29

Page 8: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Outline

1 Recap

2 Term Statistics

3 Discussion

8 / 29

Page 9: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

How big is the term vocabulary?

That is, how many distinct words are there?

Can we assume there is an upper bound?

Not really: At least 7020 ≈ 1037 different words of length 20.

The vocabulary will keep growing with collection size.

Heaps’ law: M = kT b

M is the size of the vocabulary, T is the number of tokens inthe collection.

Typical values for the parameters k and b are: 30 ≤ k ≤ 100and b ≈ 0.5.

Heaps’ law is linear in log-log space.

It is the simplest possible relationship between collection sizeand vocabulary size in log-log space.Empirical law

9 / 29

Page 10: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Power Laws in log-log space

y = cxk (k=1/2,1,2) log10 y = k ∗ log10 x + log10 c

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

sqrt(x)x

x**2

1

10

100

1 10 100

sqrt(x)x

x**2

10 / 29

Page 11: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Model collection: The Reuters collection

symbol statistic value

N documents 800,000L avg. # word tokens per document 200M word types 400,000

avg. # bytes per word token (incl. spaces/punct.) 6avg. # bytes per word token (without spaces/punct.) 4.5avg. # bytes per word type 7.5

T non-positional postings 100,000,000

1Gb of text sent over Reuters newswire 20 Aug ’96 – 19 Aug ’97

11 / 29

Page 12: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Heaps’ law for Reuters

0 2 4 6 8

01

23

45

6

log10 T

log1

0 M

Vocabulary size M as afunction of collection sizeT (number of tokens) forReuters-RCV1. For thesedata, the dashed linelog10 M = 0.49 ∗ log10 T + 1.64is the best least squares fit.Thus, M = 101.64T 0.49

andk = 101.64 ≈ 44andb = 0.49.

M = kT b = 44T .49

12 / 29

Page 13: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Empirical fit for Reuters

Good, as we just saw in the graph.

Example: for the first 1,000,020 tokens Heaps’ law predicts38,323 terms:

44 × 1,000,0200.49≈ 38,323

The actual number is 38,365 terms, very close to theprediction.

Empirical observation: fit is good in general.

13 / 29

Page 14: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Exercise

1 What is the effect of including spelling errors vs. automaticallycorrecting spelling errors on Heaps’ law?

2 Compute vocabulary size M

Looking at a collection of web pages, you find that there are3000 different terms in the first 10,000 tokens and 30,000different terms in the first 1,000,000 tokens.Assume a search engine indexes a total of 20,000,000,000(2 × 1010) pages, containing 200 tokens on averageWhat is the size of the vocabulary of the indexed collection aspredicted by Heaps’ law?

14 / 29

Page 15: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Zipf’s law

Now we have characterized the growth of the vocabulary incollections.

We also want to know how many frequent vs. infrequentterms we should expect in a collection.

In natural language, there are a few very frequent terms andvery many very rare terms.

Zipf’s law (linguist/philologist George Zipf, 1935):The i th most frequent term has frequency proportional to 1/i .

cf i ∝1i

cf i is collection frequency: the number of occurrences of theterm ti in the collection.

15 / 29

Page 16: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

http://en.wikipedia.org/wiki/Zipf’s law

Zipf’s law: the frequency of any word is inversely proportional toits rank in the frequency table. Thus the most frequent word willoccur approximately twice as often as the second most frequentword, which occurs twice as often as the fourth most frequentword, etc. Brown Corpus:

“the”: 7% of all word occurrences (69,971 of˜>1M).

“of”: ∼3.5% of words (36,411)

“and”: 2.9% (28,852)

Only 135 vocabulary items account for half the Brown Corpus.

The Brown University Standard Corpus of Present-Day American English

is a carefully compiled selection of current American English, totaling

about a million words drawn from a wide variety of sources . . . for many

years among the most-cited resources in the field.

16 / 29

Page 17: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Zipf’s law

Zipf’s law: The i th most frequent term has frequencyproportional to 1/i .

cf i ∝1i

cf is collection frequency: the number of occurrences of theterm in the collection.

So if the most frequent term (the) occurs cf1 times, then thesecond most frequent term (of) has half as many occurrencescf2 = 1

2cf1 . . .

. . . and the third most frequent term (and) has a third asmany occurrences cf3 = 1

3cf1 etc.

Equivalent: cf i = cik and log cf i = log c + k log i (for k = −1)

Example of a power law

17 / 29

Page 18: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Power Laws in log-log space

y = cx−k (k=1/2,1,2) log10 y = −k ∗ log10 x + log10 c

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

100/sqrt(x)100/x

100/x**2

1

10

100

1 10 100

100/sqrt(x)100/x

100/x**2

18 / 29

Page 19: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Zipf’s law for Reuters

0 1 2 3 4 5 6 7

01

23

45

67

log10 rank

log1

0 cf

Fit far from perfect, but nonetheless key insight:Few frequent terms, many rare terms.

19 / 29

Page 20: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

more from http://en.wikipedia.org/wiki/Zipf’s law

“A plot of word frequency in Wikipedia (27 Nov 2006). The plot is in log-log coordinates. x is rank of a word in the

frequency table; y is the total number of the words occurrences. Most popular words are “the”, “of” and “and”, as

expected. Zipf’s law corresponds to the upper linear portion of the curve, roughly following the green (1/x) line.”

20 / 29

Page 21: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Power laws more generally

E.g., consider power law distributions of the form c r−k ,describing the number of book sales versus sales-rank r of a book,or the number of Wikipedia edits made by the r th most frequentcontributor to Wikipedia.

Amazon book sales: c r−k , k ≈ .87

number of Wikipedia edits: c r−k , k ≈ 1.7

(More on power laws and the long tail here:Networks, Crowds, and Markets:

Reasoning About a Highly Connected World

by David Easley and Jon KleinbergChpt 18: http://www.cs.cornell.edu/home/kleinber/networks-book/networks-book-ch18.pdf)

21 / 29

Page 22: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

0

200

400

600

800

1000

0 200 400 600 800 1000

Wik

iped

ia e

dits

/mon

th |

Am

azon

sal

es/w

eek

User|Book rank r

40916 / r^{.87}

1258925 / r^{1.7}

Normalization given by the roughly1 sale/week for the200,000th ranked Amazon title:

40916r−.87

and by the10 edits/month for the1000th ranked Wikipedia editor:

1258925r−1.7

0.1

1

10

100

1000

10000

100000

1e+06

1e+07

1 10 100 1000 10000 100000 1e+06

Wik

iped

ia e

dits

/mon

th |

Am

azon

sal

es/w

eek

User|Book rank r

1258925 / r^{1.7}

40916 / r^{.87}

Long tail: about a quarter ofAmazon book sales estimatedto come from the long tail,i.e., those outside the top100,000 bestselling titles

22 / 29

Page 23: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Another Wikipedia count (15 May 2010)

http://imonad.com/seo/wikipedia-word-frequency-list/

All articles in the English version of Wikipedia, 21GB in XMLformat (five hours to parse entire file, extract data from markuplanguage, filter numbers, special characters, extract statistics):

Total tokens (words, no numbers): T = 1,570,455,731

Unique tokens (words, no numbers): M = 5,800,280

23 / 29

Page 24: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

“Word frequency distribution follows Zipf’s law”

24 / 29

Page 25: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

rank 1–50 (86M-3M), stop words (the, of, and, in, to, a, is,. . .)

rank 51–3K (2.4M-56K), frequent words (university, January,tea, sharp, . . .)

rank 3K–200K (56K-118), words from large comprehensivedictionaries (officiates, polytonality, neologism, . . .)above rank 50K mostly Long Tail words

rank 200K–5.8M (117-1), terms from obscure niches,misspelled words, transliterated words from other languages,new words and non-words (euprosthenops, eurotrochilus,lokottaravada, . . .)

25 / 29

Page 26: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Some selected words and associated counts

Google 197920

Twitter 894

domain 111850

domainer 22

Wikipedia 3226237

Wiki 176827

Obama 22941

Oprah 3885

Moniker 4974

GoDaddy 228

26 / 29

Page 27: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Project Gutenberg (per billion)

http://en.wiktionary.org/wiki/Wiktionary:Frequency lists#Project Gutenberg

Over 36,000 items (Jun 2011), average of > 50 new e-books / weekhttp://en.wiktionary.org/wiki/Wiktionary:Frequency lists/PG/2006/04/1-10000

the 56271872

of 33950064

and 29944184

to 25956096

in 17420636

I 11764797

that 11073318

was 10078245

his 8799755

he 8397205

it 8058110

with 7725512

is 7557477

for 7097981

as 7037543

had 6139336

you 6048903

not 5741803

be 5662527

her 5202501

. . . 100, 000th

27 / 29

Page 28: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Outline

1 Recap

2 Term Statistics

3 Discussion

28 / 29

Page 29: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides adapted … · 2013-02-06 · Exercise 1 What is the effect of including spelling errors vs. automatically correcting spelling

Discussion 1

Objective: explore three information retrieval systems (Bing, LOC,PubMed), and use each for the discovery task:“What is the medical evidence that vaccines can cause autism?”

Some general questions and observations:

How to authenticate the information?

Is the information up to date? (how to find updated info?)

In what order are items returned? (by “relevance”, but how isrelevance determined: link analysis? tf.idf?)

Use results of Bing search to refine vocabulary

Assignment: everyone upload as a test of CMS the bestreference found, and outline of strategy used to find it

29 / 29