96
CS 572: Information Retrieval Web 2.0+: Social Media/CGC

CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

CS 572: Information Retrieval

Web 2.0+: Social Media/CGC

Page 2: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Status

• Final project: updated proposal due by today/tomorrow

• Questions? Discuss at end of class.

3/28/2016 CS 572: Information Retrieval. Spring 2016 2

Page 3: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

3

3Eugene Agichtein, Emory University, IR Lab

Page 4: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

• Collaboratively Generated Content: CGC

– Creation models, quality

• Twitter/Weibo

• Searching:

– Indexing

– Ranking

3/28/2016 CS 572: Information Retrieval. Spring 2016 4

Page 5: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

5

Estimates of daily content creation [Ramakrishnan & Tomkins, IEEE Computer 2007]

Content type Amount produced / day

Published (books, magazines,

newspapers)

3-4 Gb

Professional Web (paid creation,

e.g., corporate Web site)

2 Gb

User-generated (reviews,

blogs, personal Web sites)

8-10 Gb

Private text (instant messages,

email)

3000 Gb (= 3Tb)

Page 6: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

The value of CGC

• Extrinsic value – impact on AI, IR and NLP applications

• Intrinsic value – resources for the people– How to make the data more useful

(e.g., increase user engagement)

• Synergistic view of CGC– AI techniques can be used to improve the

quality of CGC repositories,

– which can in turn improve the performance of other AI applications

6

Page 7: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

3,600,000 articles

Since 2001

2,500,000 entries

Since 2002

1,000,000,000 Q&A

Since 2005

4,000,000,000 images

Since 2004

Before7

After

Extrinsic value: sources of knowledge

Collaboratively Generated Content

articles: 65,000

Since 1768

entries: 150,000

Since 1985

assertions: 4,600,000

Since 1984

WordNet

CYC

Page 8: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Extrinsic value: what we can do with all this knowledge

• Under the hood– CGC as an additional (huge!) corpus

• Better term statistics, observe the document authoring process

– Repositories of knowledge• Concepts as features (BOW Bag of Words + Concepts)

• Concepts as word senses (way beyond WordNet)

– More comprehensive lexicons, gazetteers

– Extending existing knowledge repositories (e.g., WordNet)

• Higher-level tasks– (Concept-based) information retrieval

– Question answering

8

Page 9: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

(Text) Social Media Today

Published: 4Gb/day Social Media: 10Gb/DayYahoo Answers: 120M users, 40M questions, 1B answers

Yes, we could read your blog. Or, you could tell us about your day

9Eugene Agichtein, Emory University, IR Lab

Page 10: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Models for CGC generation: Overview

• Wikipedia: contribution, curation, and coordination

– http://en.wikipedia.org/wiki/

Wikipedia:Modelling_Wikipedia%27s_growth

• Community forums: Yahoo! Answers, focused forums

• Sharing: twitter

10

Page 11: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

New

lin

ks a

dd

ed

dail

y

Models of CGC Generation: Link Creation“Rich-Get-Richer” (for a while ...)

– Flickr and other sites alsoshow strong evidence of preferential attachment[Mislove et al. ‘08]

Preferential attachment:

More popular Wikipedia pages

are more likely to acquire links

(up to a point!)

[Capocci et al. ‘06]Po

pu

lari

ty11

Page 12: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Models of CGC Generation: Edits“Rich-Get-Richer” (again, up to a point)

Popular Wiki pages (> 1000 edits):

Editing activity initially increases than drops off

Later on, arriving visitors

are less able to contribute

novel information

[Das and Magdon-Ismail ‘10]

12

Page 13: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Models of CGC generation:Slowing Growth [Suh et al., WikiSym 2009]

• Article growth (#) per month, extrapolated to slow down dramatically by 2013

• Increased coordination and overhead, resistance to new edits

13

Page 14: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Models for CGC generationWikipedia conflict and coordination

[Kittur & Kraut,

2008]

Explicit coordination: discussion page

Implicit coordination: leading the effort

Example: “Music of Italy” page – The bulk of

edits are made by TUF-KAT and Jeffmatt

Page quality increases with number of

editors iff concentrated effort; otherwise,

more editors higher overhead, lower

quality

14

Page 15: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Models of CGC Generation: Content Visibility and Evolution

• Dynamics of votes of select Digg stories over a period of 4 days[Lerman ‘07]

• Dynamics of Y! Answersposts over a period of 4 days [Aji et al. ‘10]

15

Page 16: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Models for CGC Generation: ForumsInfo seekers vs. info providers

• In Yahoo! Answers, question and answer quality are strongly related:bad questions attractbad or mediocre answers:[Agichtein et al. ’08]

• Askers (Java forums) prefer help from users with justslightly more expertise [Zhang et al. ‘07]

16

Page 17: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Lots of content... Do we trust it?

3/28/2016 CS190: Web Science and Technology, 2010 17

On the Internet, nobody knows you are a dog.

Page 18: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

CGC Quality Filtering (Overview)

• Multiple dimensions of CGC content to exploit:

– Content persistence (Wikipedia)

– Surface quality (language-wise): all CGC

– Metadata: Author reliability, geo-spatial edit patterns

– Citation/link-based estimation of authority

• Most effective systems use a combination of all of the above

18

Page 19: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Content-Driven Author Reputation[Adler & Alfaro, WWW 2007]

19

Page 20: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Automatic Vandalism Detection

• Hoaxes

• Libelous content

• Malware distribution

• Errors picked up by mainstream media!

• Wikipedia “truthiness”

Vandalized Ben Franklin Wikipedia page

20

Page 21: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Assigning Trust to Wikipedia Text

• Estimating text trust from edit history information.

• Trust is computed as:

1) Insertion: trust value for newly inserted text (E)

2) Edge effect: the text at the edge of blocks has the same trust as newly inserted text.

3) Revision: text trust may increase, if author reputation gets higher

[Adler et al., WikiSym 2008]Online demo+code are available: http://wikitrust.soe.ucsc.edu/

Before

After

21

Page 22: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

NLP-based Vandalism Detection

• Use language modeling to predict edit quality

– Lexical: vocabulary, regular expression/lists

– Syntactic: N-gram analysis using part-of-speech tags

– Semantic: N-gram analysis with hand-picked words

[Wang et al., COLING 2010]

Figure adapted from Andrew G. West, 2010

22

Page 23: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Comparison of Vandalism Detection Approaches: Shared tasks [Potthast et al. ‘10]

• Shared vandalism corpus, 32,000 edits

• Best single approach: NLP-based

• Meta-detector: best– Task requires different

classes of features.

• CLEF 2010 Competition on Vandalism Detection:– Best result: AUC 0.92,

Precision = 100% atRecall = 20%

23

Page 24: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Finding High Quality Content in SM

• Well-written• Interesting• Relevant (answer)• Factually correct• Popular?• Provocative?• Useful?

24

As judged by

professional editors

E. Agichtein, C. Castillo, D. Donato, A. Gionis,

and G. Mishne, Finding High Quality Content in

Social Media, in WSDM 2008

Eugene Agichtein, Emory University, IR Lab

Page 25: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

25

25

Eugene Agichtein, Emory University, IR Lab

Page 26: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

26

How do Question and Answer Quality relate?

26Eugene Agichtein, Emory University, IR Lab

Page 27: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

27

27

Eugene Agichtein, Emory University, IR Lab

Page 28: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

28

28

Eugene Agichtein, Emory University, IR Lab

Page 29: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

29

29

Eugene Agichtein, Emory University, IR Lab

Page 30: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

30

30

Eugene Agichtein, Emory University, IR Lab

Page 31: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Community

31Eugene Agichtein, Emory University, IR Lab

Page 32: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Link Analysis for Authority Estimation

32

Question 1

Question 2

Answer 5

Answer 1

Answer 2

Answer 4

Answer 3

User 1

User 2

User 3

User 6

User 4

User 5

Answer 6

Question 3

User 1

User 2

User 3

User 6

User 4

User 5

Kj

jAiH..0

)()(

Mi

iHjA..0

)()(

Hub (asker) Authority (answerer)

Eugene Agichtein, Emory University, IR Lab

[Jurzcyk & Agichtein, CIKM 2007]

Page 33: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Qualitative Observations

HITS effective

HITS ineffective

33Eugene Agichtein, Emory University, IR Lab

Page 34: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

34

34

Random forest

classifier

Eugene Agichtein, Emory University, IR Lab

Page 35: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Result 1: Identifying High Quality Questions

35Eugene Agichtein, Emory University, IR Lab

Page 36: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Top Features for Question Classification

• Asker popularity (“stars”)

• Punctuation density

• Topical category

• Page views

• KL Divergence from reference corpus LM

36Eugene Agichtein, Emory University, IR Lab

Page 37: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Identifying High Quality Answers

37Eugene Agichtein, Emory University, IR Lab

Page 38: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Top Features for Answer Classification

• Answer length

• Community ratings

Answerer reputation

• Word overlap

• Kincaid readability score

38Eugene Agichtein, Emory University, IR Lab

Page 39: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

CGC as a corpus

• Better statistics on a larger or cleaner text collection

– Compute IDF on the entire Wikipedia corpus

• Beware of different style / topic distribution

– Cf. Web as a corpus (Mihalcea’04, Gonzalo et al.’ 03)

• Web corpus statistics, such as counts of search results

– Cf. using very large corpora in NLP (Banko and Brill ‘01a, ‘01b)

39

Page 40: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

CGC as a corpus (cont’d)

• Extend existing knowledge repositories (e.g., WordNet)

• Create new datasets, labeled by construction

– WSD datasets using OpenMind – Chklovski & Mihalcea ‘02, ‘03

– WSD datasets based on Wikipedia – Mihalcea’07; Bunescu & Pasca ‘06; Cucerzan ’07

– Text categorization datasets based on the ODP (Davidov et al. ‘04)

• Study the authoring process by observing multiple revisions of a document

40

Page 41: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

41 CGC as corpus / Extending WordNet

Page 42: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Using corpus statistics for term weighting

• J. Elsas and S. Dumais. Leveraging temporal dynamics of document content in relevance ranking. WSDM 2010– Periodic crawls (starting from an unknown point in document’s life) vs.

formally tracked revision history

– LM-specific vs. general term frequency model, which can be plugged into any retrieval model

– Aji et al.’10 also model bursts – “significant” events in the document’s life

• M. Efron. Linear time series models for term weighting in information retrieval. JASIST 2010– Temporal behavior of terms as the entire collection changes over time

42 CGC as corpus / Document revisions

Page 43: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Extending term weighting models with Revision History Analysis (RHA) [Aji et al. ’10]

• RHA captures term importance from the document authoring process.

• Natural integration with state of the art retrieval models (BM25, LM)

• Consistent improvement over existing retrieval models

• Beyond Wikipedia

– Tracking changes in MS Word documents; version control

– Past snapshots of Web pages

(Elsas & Dumais ’10 constructed a new corpus)

CGC as corpus / Document revisions43

Page 44: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Observable document generation process Example: Revisions of “Topology” in Wikipedia

1st revision:

250th revision:

Current revision:

CGC as corpus / Document revisions44

Page 45: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Revision History Analysis redefines Term Frequency (TF)

• TF is a key indicator of document relevance

• TF can be naturally integrated into ranking models

𝑆 𝑄, 𝐷 =

𝑡𝜖𝑄

𝐼𝐷𝐹 𝑡 ∙𝑇𝐹 𝑡, 𝐷 ∙ 𝑘1 + 1

𝑇𝐹 𝑡, 𝐷 + 𝑘1 1 − 𝑏 + 𝑏 ∙𝐷

𝑎𝑣𝑔𝑑𝑙

𝑆 𝑄, 𝐷 = 𝐷(𝑄| 𝐷 =

𝑡𝜖𝑉

𝑃 𝑡|𝑄 ∙ log𝑃 𝑡 𝑄

𝑃 𝑡 𝐷

BM25

Language Model

CGC as corpus / Document revisions45

Page 46: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

RHA global model

Define the term frequency based on the whole document generation process

– a document grows steadily over time

– a term is relatively important if it appears in the early revisions

𝑇𝐹𝑔𝑙𝑜𝑏𝑎𝑙 𝑡, 𝑑 =

𝑗=1

𝑛𝑐(𝑡, 𝑣𝑗)

𝑗𝛼

Frequency of term 𝑡 in revision 𝑣𝑗

Decay factor

CGC as corpus / Document revisions46

Iteration over revisions

Page 47: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

RHA global model: example

CGC as corpus / Document revisions

Decayed term importance

Decay factor

47

Page 48: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

But some pages are different: “Avatar(2009 film)” – very bursty editing process

1st revision:

500th revision:

Current revision:

CGC as corpus / Document revisions48

Page 49: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

RHA burst model

• A burst resets the decay clock for a term

• The weight will decrease after a burst

𝑇𝐹𝑏𝑢𝑟𝑠𝑡 𝑡, 𝑑 =

𝑗=1

𝑚

𝑘=𝑏𝑗

𝑛𝑐(𝑡, 𝑣𝑘)

(𝑘 − 𝑏𝑗 + 1)𝛽

Frequency of term 𝑡 in revision 𝑣𝑘

Decay factor for jth burst

Burst detection

Content-based (relative content change)

Activity-based (intensive edit activity)

Both models combined

CGC as corpus / Document revisions49

Iteration over bursts

Iteration over revisions within

each burst

Page 50: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

RHA burst model: example

CGC as corpus / Document revisions

Decay factor

Term importance

50

Page 51: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

𝑇𝐹𝑟ℎ𝑎 𝑡, 𝐷 = 𝜆1 ∙ 𝑇𝐹𝑔 𝑡, 𝐷 + 𝜆2 ∙ 𝑇𝐹𝑏 𝑡, 𝐷 + 𝜆3 ∙ 𝑇𝐹 𝑡, 𝐷

Combining the global model with the burst model:

𝜆1 + 𝜆2 + 𝜆3 = 1

CGC as corpus / Document revisions

Putting it all together:RHA Term Frequency

global original TFburst

51

Page 52: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Integrating RHA into retrieval models

𝑆 𝑄, 𝐷 =

𝑡𝜖𝑄

𝐼𝐷𝐹 𝑡 ∙𝑇𝐹 𝑡, 𝐷 ∙ 𝑘1 + 1

𝑇𝐹 𝑡, 𝐷 + 𝑘1 1 − 𝑏 + 𝑏 ∙𝐷

𝑎𝑣𝑔𝑑𝑙

BM25

𝑆 𝑄, 𝐷 = 𝐷(𝑄| 𝐷 =

𝑡𝜖𝑉

𝑃 𝑡|𝑄 ∙ log𝑃 𝑡 𝑄

𝑃 𝑡 𝐷

Statistical Language Models

𝑇𝐹𝑟ℎ𝑎 𝑡, 𝐷

𝑇𝐹𝑟ℎ𝑎 𝑡, 𝐷

𝑃𝑟ℎ𝑎 𝑡, 𝐷

+ RHA

+ RHA

RHA Term Probability:

Performance improvements: +2-7% (INEX), +1-4% (TREC)

CGC as corpus / Document revisions52

Page 53: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Overview of Part 3

Use CGC as an additional (huge!) corpus• Better term statistics

• Observing the authoring process

CGC as repositories of world knowledge• Distilling knowledge from CGC structure

and content

• Concepts as features(BOW Bag of Words + Concepts)

– Semantic relatedness, and later IR

• Concepts as word senses– WSD

53

Page 54: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

CGC as repositories of senses and concepts

• Articles / concepts as representation features

– From the bag of words to

the bag of words + concepts

– Semantic relatedness of words (and texts)

– Later: concept-based IR

• Articles / concepts as word senses

– Word sense disambiguation

CGC as knowledge repositories54

Page 55: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Semantic relatedness of words and texts

• How related are

cat mouse

Augean stables golden apples of the Hesperides

慕田峪 (Mutianyu) 司马台 (Simatai)

• Used in many applications

– Information retrieval

– Word-sense disambiguation

– Error correction (“Web site” or “Web sight”)

55 CGC as knowledge / Semantic relatedness

Page 56: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Evaluation criterion: correlation with human judgments

Human score [0 .. 10]

journey voyage 9.29

Jerusalem Israel 8.46

money property 7.57

football tennis 6.63

alcohol chemistry 5.54

coast hill 4.38

professor cucumber 0.23

56 CGC as knowledge / Semantic relatedness

Page 57: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Previous approaches

• WordNet-based measures

– Using graph structure (path length, random walks), information content, glosses

– Comprehensive review: Budanitsky & Hirst ‘06

– Patwardhan et al. ‘03 – using semantic relatedness measures for WSD

• Jiang-Conrath (WN structure + info content) performs on par with Adapted Lesk (gloss overlap)

• All other WN-based measures are inferior to Adapted Lesk

• Co-occurrence based measures

– Latent Semantic Analysis (Deerwester et al. ‘90)

CGC as knowledge / Semantic relatedness57

Page 58: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

The role of background knowledge

When human judge semantic relatedness they

use massive amounts of background

knowledge

Can we endow computers with such

capability of knowledge-based

reasoning?

CGC as knowledge / Semantic relatedness58

Page 59: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Using Wikipedia for computing semantic relatedness

• Structure (links)

– Strube & Ponzetto ’06 (WikiRelate)

– Milne & Witten ’08 (WLM)

• Content (text)

– Gabrilovich & Markovitch ’07 (ESA)

• Structure + content

– Yeh et al. ’09 (WikiWalk)

CGC as knowledge / Semantic relatedness59

Page 60: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Explicit Semantic Analysis (ESA)[Gabrilovich & Markovitch ‘06-’09]

• Wikipedia-based semantic representation

• Some ESA applications– Computing semantic relatedness of words and texts

– Text categorization

– Information retrieval

• Cross-lingual retrieval utilizing cross-language links in Wikipedia

CGC as knowledge / Semantic relatedness60

Page 61: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

The AI Dream61

Problem

solverProblem Solution

CGC concepts as features

Page 62: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

62

The circular dependency problem

Problem

solverProblem Solution

Natural Language

Understanding

CGC concepts as features

Page 63: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

63

Problem

solverProblem Solution

Natural Language

Understanding

Wikipedia-based

semantics Statistical text processing

Circumventing the need for deep Natural Language Understanding

CGC concepts as features

Page 64: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Every Wikipedia article represents a concept

Leopard

64 CGC concepts as features

Page 65: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Wikipedia can be viewed as an ontology –a collection of concepts

65

Leopard

Isaac

Newton

Ocean

CGC concepts as features

Page 66: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

The semantics of a word is a vector of its associations with Wikipedia concepts

Word

66

Cf. Bengio’s talk

tomorrow on learning

text and image

representations

CGC concepts as features

Page 67: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

0.0 1.0

0.0

The semantics of a

word is a point in the

multi-dimensional

space defined by the

set of Wikipedia

concepts

word

67

Word representation: 2D example

Wikipedia

concept B

Wikipedia

concept A

1.0

CGC concepts as features

Page 68: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

0.0 1.0

0.0

68

Text representation: 2D example

Wikipedia

concept B

Wikipedia

concept A

1.0

The semantics of a

text fragment is a

centroid of the vectors

of the individual words

word

1

word

2

word

3

word

4

CGC concepts as features

Page 69: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

“Cat”

Leopard

Mouse

Veterinarian

PetsTennis

ball

Elephant

House

CGC concepts as features69

Page 70: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Cat

Tom &

Jerry

Cat [0.92] Leopard [0.84] Roar [0.77]

Cat [0.95] Felis [0.74] Claws [0.61]

Animation [0.85] Mouse [0.82] Cat [0.67]

CatCat

0.95

Panthera

0.92

Tom & Jerry

0.67

Concept and word representation in Explicit Semantic Analysis[Gabrilovich & Markovitch ’09]

70

Panthera

CGC concepts as features

Page 71: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

71

Leopar

d0.84

Mac OS X

v10.5

0.61

Cat

0.52

Panthera

CatCat

0.95

Panthera

0.92

Tom & Jerry

0.67

Computing semantic relatednessas cosine between two vectors

CGC concepts as features

Page 72: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Experimental results (individual words)

Algorithm Correlation with

human judgments

WordNet-based 0.33–0.35

Roget’s Thesaurus 0.55

LSA 0.56

WikiRelate 0.49

WLM 0.69

ESA (ODP) 0.65

ESA (Wikipedia) 0.75

TSA [Radinsky et al. ‘11] 0.8

EWC (ESA/Wikipedia + WordNet + collocations)

[Haralambous & Klyuev, forthcoming at IJCNLP’11]

0.79-0.87

CGC concepts as features72

Page 73: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Using Wikipedia for Named Entity Disambiguation [Bunescu & Pasca ’06]

• Both the method and the dataset are based on Wikipedia

• Dataset: 1.7M labeled examples– ''The [[Vatican City | Vatican]] is now an independent enclave ...'‘

Geographical location, not the Holy See

CGC for WSD

The basic approach:

compute the cosine

similarity between the

word context and the

target articles

73

Page 74: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Problems with the basic approach

1. Short articles and stubs

2. Synonymy (lexical mismatch between the word context and the article)

Solution: use categories for generalization

CGC for WSD74

Page 75: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Word-Category Correlations

“John Williams and the Boston Pops a summer Star Wars concert at Tanglewood.”conducted

John Williams (composer)

?

John Williams (wrestler)

Film score composers

Composers

Musicians

Professional wrestlers

Wrestlers

People known in connection

with sports and hobbies

People by occupation

When a Wikipedia article is too short,

get additional context from its categories

CGC for WSD75

Page 76: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

The full solution [Bunescu & Pasca ’06]

• ML-based approach (SVM) to rank candidate senses

• Features:

– Cosine(word context, article text)

– Word-category affinities (binary features indicating if a word has occurred in any article of a given category)

• Dimensionality (|V| x |C|) can be reduced by thresholding word frequencies

• Accuracy improvement vs. cosine-only: 3%-25.5%

CGC for WSD76

Page 77: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Results [Bunescu & Pasca ’06]

• The classification model can be trained on Wikipedia, then applied to any input text– Senses can only be resolved to those

defined in Wikipedia

• Another idea to overcome the lexical mismatch: augment Wikipedia articles with contexts of corresponding word senses – Cf. anchor text aggregation [Metzler et al. ’09] and

[Cucerzan ‘07]

CGC for WSD

Taxonomy Kernel

77

Page 78: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

An alternative approach to named entity disambiguation [Cucerzan ‘07]

• Applicable to any text, not solely that from Wikipedia

– Sense inventory still limited to Wikipedia

• The system performed both NE identificationand disambiguation

• World knowledge used– Known entities (Wikipedia articles)

– Their class (Person, Location, Organization)

– Known surface forms (redirection pages + anchor text)

– Words that describe or co-occur with the entity

– Categories, lists, enumeration, tables

CGC for WSD78

Page 79: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Restricted document representation:aggregation of features of all identified surface forms

for all entity names

CGC for WSD79

Entities that can be

referred to as “Columbia”

Page 80: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Key idea: joint disambiguation

• Maximize the agreement – between the entity vectors and the document, and

– between the categories of the entities

• Example: {Columbia, Discovery}– Space Shuttle Columbia & Space Shuttle Discovery

– Columbia Pictures & Discovery Investigations

– LIST_astronomical_topics, CAT_manned_spacecraft, CAT_space_shuttles

• If the “one sense per discourse” assumption is violated – shrink the context iteratively

CGC for WSD80

Page 81: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Results [Cucerzan ‘07]

• Baseline: most frequent Wikipedia sense

• Wikipedia dataset (labeled by construction)

– 86.2% 88.3%

• MSNBC news stories (human labeling)

– 51.7% 91.4%

– Surface forms common in news are highly ambiguous

– Popularity in Wikipedia ≠ popularity in news

CGC for WSD

Blackberry

China

81

Page 82: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Case Study: Searching Tweets

3/28/2016 CS 572: Information Retrieval. Spring 2016 82

Page 83: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Twitter Statistics (2014)

• 284 million monthly active users

• 500 million tweets are sent per day

• More than 300 billion tweets have been sent since company founding in 2006

• Tweets-per-second record: one-second peak of 143,199 TPS.

• More than 2 billion search queries per day.

3/28/2016 CS 572: Information Retrieval. Spring 2016 83

Page 84: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Twitter Search Architecture

3/28/2016 CS 572: Information Retrieval. Spring 2016 84

Page 85: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Twitter Search Architecture

3/28/2016 CS 572: Information Retrieval. Spring 2016 85

Page 86: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Twitter Search Architecture

3/28/2016 CS 572: Information Retrieval. Spring 2016 86

Page 87: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Active Index Segments: Write Friendly

3/28/2016 CS 572: Information Retrieval. Spring 2016 87

Page 88: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Twitter Search Architecture

3/28/2016 CS 572: Information Retrieval. Spring 2016 88

Page 89: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Twitter Lucene Extensions

3/28/2016 CS 572: Information Retrieval. Spring 2016 89

Page 90: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Twitter Lucene Extensions (cont’d)

3/28/2016 CS 572: Information Retrieval. Spring 2016 90

Page 91: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Lucene Extensions: Storage

3/28/2016 CS 572: Information Retrieval. Spring 2016 91

Page 92: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

In-Memory Real Time Index

3/28/2016 CS 572: Information Retrieval. Spring 2016 92

Page 93: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Revisit Old Idea: Probabilistic Skip Lists

3/28/2016 CS 572: Information Retrieval. Spring 2016 93

Page 94: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

3/28/2016 CS 572: Information Retrieval. Spring 2016 94

Page 95: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Top Entities (cont’d)

3/28/2016 CS 572: Information Retrieval. Spring 2016 95

Page 96: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers

Resources

• CGC tutorial at WSDM 2012:http://www.mathcs.emory.edu/~eugene/wsdm2012-cgc-tutorial/

• Real-time Search at Twitter: https://www.youtube.com/watch?v=cbksU0qkFm0

http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.pdf

http://www.slideshare.net/lucidworks/search-at-twitter-presented-by-michael-busch-twitter

3/28/2016 CS 572: Information Retrieval. Spring 2016 96