Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CS 572: Information Retrieval
Web 2.0+: Social Media/CGC
Status
• Final project: updated proposal due by today/tomorrow
• Questions? Discuss at end of class.
3/28/2016 CS 572: Information Retrieval. Spring 2016 2
3
3Eugene Agichtein, Emory University, IR Lab
• Collaboratively Generated Content: CGC
– Creation models, quality
• Twitter/Weibo
• Searching:
– Indexing
– Ranking
3/28/2016 CS 572: Information Retrieval. Spring 2016 4
5
Estimates of daily content creation [Ramakrishnan & Tomkins, IEEE Computer 2007]
Content type Amount produced / day
Published (books, magazines,
newspapers)
3-4 Gb
Professional Web (paid creation,
e.g., corporate Web site)
2 Gb
User-generated (reviews,
blogs, personal Web sites)
8-10 Gb
Private text (instant messages,
email)
3000 Gb (= 3Tb)
The value of CGC
• Extrinsic value – impact on AI, IR and NLP applications
• Intrinsic value – resources for the people– How to make the data more useful
(e.g., increase user engagement)
• Synergistic view of CGC– AI techniques can be used to improve the
quality of CGC repositories,
– which can in turn improve the performance of other AI applications
6
3,600,000 articles
Since 2001
2,500,000 entries
Since 2002
1,000,000,000 Q&A
Since 2005
4,000,000,000 images
Since 2004
Before7
After
Extrinsic value: sources of knowledge
Collaboratively Generated Content
articles: 65,000
Since 1768
entries: 150,000
Since 1985
assertions: 4,600,000
Since 1984
WordNet
CYC
Extrinsic value: what we can do with all this knowledge
• Under the hood– CGC as an additional (huge!) corpus
• Better term statistics, observe the document authoring process
– Repositories of knowledge• Concepts as features (BOW Bag of Words + Concepts)
• Concepts as word senses (way beyond WordNet)
– More comprehensive lexicons, gazetteers
– Extending existing knowledge repositories (e.g., WordNet)
• Higher-level tasks– (Concept-based) information retrieval
– Question answering
8
(Text) Social Media Today
Published: 4Gb/day Social Media: 10Gb/DayYahoo Answers: 120M users, 40M questions, 1B answers
Yes, we could read your blog. Or, you could tell us about your day
9Eugene Agichtein, Emory University, IR Lab
Models for CGC generation: Overview
• Wikipedia: contribution, curation, and coordination
– http://en.wikipedia.org/wiki/
Wikipedia:Modelling_Wikipedia%27s_growth
• Community forums: Yahoo! Answers, focused forums
• Sharing: twitter
10
New
lin
ks a
dd
ed
dail
y
Models of CGC Generation: Link Creation“Rich-Get-Richer” (for a while ...)
– Flickr and other sites alsoshow strong evidence of preferential attachment[Mislove et al. ‘08]
Preferential attachment:
More popular Wikipedia pages
are more likely to acquire links
(up to a point!)
[Capocci et al. ‘06]Po
pu
lari
ty11
Models of CGC Generation: Edits“Rich-Get-Richer” (again, up to a point)
Popular Wiki pages (> 1000 edits):
Editing activity initially increases than drops off
Later on, arriving visitors
are less able to contribute
novel information
[Das and Magdon-Ismail ‘10]
12
Models of CGC generation:Slowing Growth [Suh et al., WikiSym 2009]
• Article growth (#) per month, extrapolated to slow down dramatically by 2013
• Increased coordination and overhead, resistance to new edits
13
Models for CGC generationWikipedia conflict and coordination
[Kittur & Kraut,
2008]
Explicit coordination: discussion page
Implicit coordination: leading the effort
Example: “Music of Italy” page – The bulk of
edits are made by TUF-KAT and Jeffmatt
Page quality increases with number of
editors iff concentrated effort; otherwise,
more editors higher overhead, lower
quality
14
Models of CGC Generation: Content Visibility and Evolution
• Dynamics of votes of select Digg stories over a period of 4 days[Lerman ‘07]
• Dynamics of Y! Answersposts over a period of 4 days [Aji et al. ‘10]
15
Models for CGC Generation: ForumsInfo seekers vs. info providers
• In Yahoo! Answers, question and answer quality are strongly related:bad questions attractbad or mediocre answers:[Agichtein et al. ’08]
• Askers (Java forums) prefer help from users with justslightly more expertise [Zhang et al. ‘07]
16
Lots of content... Do we trust it?
3/28/2016 CS190: Web Science and Technology, 2010 17
On the Internet, nobody knows you are a dog.
CGC Quality Filtering (Overview)
• Multiple dimensions of CGC content to exploit:
– Content persistence (Wikipedia)
– Surface quality (language-wise): all CGC
– Metadata: Author reliability, geo-spatial edit patterns
– Citation/link-based estimation of authority
• Most effective systems use a combination of all of the above
18
Content-Driven Author Reputation[Adler & Alfaro, WWW 2007]
19
Automatic Vandalism Detection
• Hoaxes
• Libelous content
• Malware distribution
• Errors picked up by mainstream media!
• Wikipedia “truthiness”
Vandalized Ben Franklin Wikipedia page
20
Assigning Trust to Wikipedia Text
• Estimating text trust from edit history information.
• Trust is computed as:
1) Insertion: trust value for newly inserted text (E)
2) Edge effect: the text at the edge of blocks has the same trust as newly inserted text.
3) Revision: text trust may increase, if author reputation gets higher
[Adler et al., WikiSym 2008]Online demo+code are available: http://wikitrust.soe.ucsc.edu/
Before
After
21
NLP-based Vandalism Detection
• Use language modeling to predict edit quality
– Lexical: vocabulary, regular expression/lists
– Syntactic: N-gram analysis using part-of-speech tags
– Semantic: N-gram analysis with hand-picked words
[Wang et al., COLING 2010]
Figure adapted from Andrew G. West, 2010
22
Comparison of Vandalism Detection Approaches: Shared tasks [Potthast et al. ‘10]
• Shared vandalism corpus, 32,000 edits
• Best single approach: NLP-based
• Meta-detector: best– Task requires different
classes of features.
• CLEF 2010 Competition on Vandalism Detection:– Best result: AUC 0.92,
Precision = 100% atRecall = 20%
23
Finding High Quality Content in SM
• Well-written• Interesting• Relevant (answer)• Factually correct• Popular?• Provocative?• Useful?
24
As judged by
professional editors
E. Agichtein, C. Castillo, D. Donato, A. Gionis,
and G. Mishne, Finding High Quality Content in
Social Media, in WSDM 2008
Eugene Agichtein, Emory University, IR Lab
25
25
Eugene Agichtein, Emory University, IR Lab
26
How do Question and Answer Quality relate?
26Eugene Agichtein, Emory University, IR Lab
27
27
Eugene Agichtein, Emory University, IR Lab
28
28
Eugene Agichtein, Emory University, IR Lab
29
29
Eugene Agichtein, Emory University, IR Lab
30
30
Eugene Agichtein, Emory University, IR Lab
Community
31Eugene Agichtein, Emory University, IR Lab
Link Analysis for Authority Estimation
32
Question 1
Question 2
Answer 5
Answer 1
Answer 2
Answer 4
Answer 3
User 1
User 2
User 3
User 6
User 4
User 5
Answer 6
Question 3
User 1
User 2
User 3
User 6
User 4
User 5
Kj
jAiH..0
)()(
Mi
iHjA..0
)()(
Hub (asker) Authority (answerer)
Eugene Agichtein, Emory University, IR Lab
[Jurzcyk & Agichtein, CIKM 2007]
Qualitative Observations
HITS effective
HITS ineffective
33Eugene Agichtein, Emory University, IR Lab
34
34
Random forest
classifier
Eugene Agichtein, Emory University, IR Lab
Result 1: Identifying High Quality Questions
35Eugene Agichtein, Emory University, IR Lab
Top Features for Question Classification
• Asker popularity (“stars”)
• Punctuation density
• Topical category
• Page views
• KL Divergence from reference corpus LM
36Eugene Agichtein, Emory University, IR Lab
Identifying High Quality Answers
37Eugene Agichtein, Emory University, IR Lab
Top Features for Answer Classification
• Answer length
• Community ratings
Answerer reputation
• Word overlap
• Kincaid readability score
38Eugene Agichtein, Emory University, IR Lab
CGC as a corpus
• Better statistics on a larger or cleaner text collection
– Compute IDF on the entire Wikipedia corpus
• Beware of different style / topic distribution
– Cf. Web as a corpus (Mihalcea’04, Gonzalo et al.’ 03)
• Web corpus statistics, such as counts of search results
– Cf. using very large corpora in NLP (Banko and Brill ‘01a, ‘01b)
39
CGC as a corpus (cont’d)
• Extend existing knowledge repositories (e.g., WordNet)
• Create new datasets, labeled by construction
– WSD datasets using OpenMind – Chklovski & Mihalcea ‘02, ‘03
– WSD datasets based on Wikipedia – Mihalcea’07; Bunescu & Pasca ‘06; Cucerzan ’07
– Text categorization datasets based on the ODP (Davidov et al. ‘04)
• Study the authoring process by observing multiple revisions of a document
40
41 CGC as corpus / Extending WordNet
Using corpus statistics for term weighting
• J. Elsas and S. Dumais. Leveraging temporal dynamics of document content in relevance ranking. WSDM 2010– Periodic crawls (starting from an unknown point in document’s life) vs.
formally tracked revision history
– LM-specific vs. general term frequency model, which can be plugged into any retrieval model
– Aji et al.’10 also model bursts – “significant” events in the document’s life
• M. Efron. Linear time series models for term weighting in information retrieval. JASIST 2010– Temporal behavior of terms as the entire collection changes over time
42 CGC as corpus / Document revisions
Extending term weighting models with Revision History Analysis (RHA) [Aji et al. ’10]
• RHA captures term importance from the document authoring process.
• Natural integration with state of the art retrieval models (BM25, LM)
• Consistent improvement over existing retrieval models
• Beyond Wikipedia
– Tracking changes in MS Word documents; version control
– Past snapshots of Web pages
(Elsas & Dumais ’10 constructed a new corpus)
CGC as corpus / Document revisions43
Observable document generation process Example: Revisions of “Topology” in Wikipedia
1st revision:
250th revision:
Current revision:
CGC as corpus / Document revisions44
Revision History Analysis redefines Term Frequency (TF)
• TF is a key indicator of document relevance
• TF can be naturally integrated into ranking models
𝑆 𝑄, 𝐷 =
𝑡𝜖𝑄
𝐼𝐷𝐹 𝑡 ∙𝑇𝐹 𝑡, 𝐷 ∙ 𝑘1 + 1
𝑇𝐹 𝑡, 𝐷 + 𝑘1 1 − 𝑏 + 𝑏 ∙𝐷
𝑎𝑣𝑔𝑑𝑙
𝑆 𝑄, 𝐷 = 𝐷(𝑄| 𝐷 =
𝑡𝜖𝑉
𝑃 𝑡|𝑄 ∙ log𝑃 𝑡 𝑄
𝑃 𝑡 𝐷
BM25
Language Model
CGC as corpus / Document revisions45
RHA global model
Define the term frequency based on the whole document generation process
– a document grows steadily over time
– a term is relatively important if it appears in the early revisions
𝑇𝐹𝑔𝑙𝑜𝑏𝑎𝑙 𝑡, 𝑑 =
𝑗=1
𝑛𝑐(𝑡, 𝑣𝑗)
𝑗𝛼
Frequency of term 𝑡 in revision 𝑣𝑗
Decay factor
CGC as corpus / Document revisions46
Iteration over revisions
RHA global model: example
CGC as corpus / Document revisions
Decayed term importance
Decay factor
47
But some pages are different: “Avatar(2009 film)” – very bursty editing process
1st revision:
500th revision:
Current revision:
CGC as corpus / Document revisions48
RHA burst model
• A burst resets the decay clock for a term
• The weight will decrease after a burst
𝑇𝐹𝑏𝑢𝑟𝑠𝑡 𝑡, 𝑑 =
𝑗=1
𝑚
𝑘=𝑏𝑗
𝑛𝑐(𝑡, 𝑣𝑘)
(𝑘 − 𝑏𝑗 + 1)𝛽
Frequency of term 𝑡 in revision 𝑣𝑘
Decay factor for jth burst
Burst detection
Content-based (relative content change)
Activity-based (intensive edit activity)
Both models combined
CGC as corpus / Document revisions49
Iteration over bursts
Iteration over revisions within
each burst
RHA burst model: example
CGC as corpus / Document revisions
Decay factor
Term importance
50
𝑇𝐹𝑟ℎ𝑎 𝑡, 𝐷 = 𝜆1 ∙ 𝑇𝐹𝑔 𝑡, 𝐷 + 𝜆2 ∙ 𝑇𝐹𝑏 𝑡, 𝐷 + 𝜆3 ∙ 𝑇𝐹 𝑡, 𝐷
Combining the global model with the burst model:
𝜆1 + 𝜆2 + 𝜆3 = 1
CGC as corpus / Document revisions
Putting it all together:RHA Term Frequency
global original TFburst
51
Integrating RHA into retrieval models
𝑆 𝑄, 𝐷 =
𝑡𝜖𝑄
𝐼𝐷𝐹 𝑡 ∙𝑇𝐹 𝑡, 𝐷 ∙ 𝑘1 + 1
𝑇𝐹 𝑡, 𝐷 + 𝑘1 1 − 𝑏 + 𝑏 ∙𝐷
𝑎𝑣𝑔𝑑𝑙
BM25
𝑆 𝑄, 𝐷 = 𝐷(𝑄| 𝐷 =
𝑡𝜖𝑉
𝑃 𝑡|𝑄 ∙ log𝑃 𝑡 𝑄
𝑃 𝑡 𝐷
Statistical Language Models
𝑇𝐹𝑟ℎ𝑎 𝑡, 𝐷
𝑇𝐹𝑟ℎ𝑎 𝑡, 𝐷
𝑃𝑟ℎ𝑎 𝑡, 𝐷
+ RHA
+ RHA
RHA Term Probability:
Performance improvements: +2-7% (INEX), +1-4% (TREC)
CGC as corpus / Document revisions52
Overview of Part 3
Use CGC as an additional (huge!) corpus• Better term statistics
• Observing the authoring process
CGC as repositories of world knowledge• Distilling knowledge from CGC structure
and content
• Concepts as features(BOW Bag of Words + Concepts)
– Semantic relatedness, and later IR
• Concepts as word senses– WSD
53
CGC as repositories of senses and concepts
• Articles / concepts as representation features
– From the bag of words to
the bag of words + concepts
– Semantic relatedness of words (and texts)
– Later: concept-based IR
• Articles / concepts as word senses
– Word sense disambiguation
CGC as knowledge repositories54
Semantic relatedness of words and texts
• How related are
cat mouse
Augean stables golden apples of the Hesperides
慕田峪 (Mutianyu) 司马台 (Simatai)
• Used in many applications
– Information retrieval
– Word-sense disambiguation
– Error correction (“Web site” or “Web sight”)
55 CGC as knowledge / Semantic relatedness
Evaluation criterion: correlation with human judgments
Human score [0 .. 10]
journey voyage 9.29
Jerusalem Israel 8.46
money property 7.57
football tennis 6.63
alcohol chemistry 5.54
coast hill 4.38
professor cucumber 0.23
56 CGC as knowledge / Semantic relatedness
Previous approaches
• WordNet-based measures
– Using graph structure (path length, random walks), information content, glosses
– Comprehensive review: Budanitsky & Hirst ‘06
– Patwardhan et al. ‘03 – using semantic relatedness measures for WSD
• Jiang-Conrath (WN structure + info content) performs on par with Adapted Lesk (gloss overlap)
• All other WN-based measures are inferior to Adapted Lesk
• Co-occurrence based measures
– Latent Semantic Analysis (Deerwester et al. ‘90)
CGC as knowledge / Semantic relatedness57
The role of background knowledge
When human judge semantic relatedness they
use massive amounts of background
knowledge
Can we endow computers with such
capability of knowledge-based
reasoning?
CGC as knowledge / Semantic relatedness58
Using Wikipedia for computing semantic relatedness
• Structure (links)
– Strube & Ponzetto ’06 (WikiRelate)
– Milne & Witten ’08 (WLM)
• Content (text)
– Gabrilovich & Markovitch ’07 (ESA)
• Structure + content
– Yeh et al. ’09 (WikiWalk)
CGC as knowledge / Semantic relatedness59
Explicit Semantic Analysis (ESA)[Gabrilovich & Markovitch ‘06-’09]
• Wikipedia-based semantic representation
• Some ESA applications– Computing semantic relatedness of words and texts
– Text categorization
– Information retrieval
• Cross-lingual retrieval utilizing cross-language links in Wikipedia
CGC as knowledge / Semantic relatedness60
The AI Dream61
Problem
solverProblem Solution
CGC concepts as features
62
The circular dependency problem
Problem
solverProblem Solution
Natural Language
Understanding
CGC concepts as features
63
Problem
solverProblem Solution
Natural Language
Understanding
Wikipedia-based
semantics Statistical text processing
Circumventing the need for deep Natural Language Understanding
CGC concepts as features
Every Wikipedia article represents a concept
Leopard
64 CGC concepts as features
Wikipedia can be viewed as an ontology –a collection of concepts
65
Leopard
Isaac
Newton
Ocean
CGC concepts as features
The semantics of a word is a vector of its associations with Wikipedia concepts
Word
66
Cf. Bengio’s talk
tomorrow on learning
text and image
representations
CGC concepts as features
0.0 1.0
0.0
The semantics of a
word is a point in the
multi-dimensional
space defined by the
set of Wikipedia
concepts
word
67
Word representation: 2D example
Wikipedia
concept B
Wikipedia
concept A
1.0
CGC concepts as features
0.0 1.0
0.0
68
Text representation: 2D example
Wikipedia
concept B
Wikipedia
concept A
1.0
The semantics of a
text fragment is a
centroid of the vectors
of the individual words
word
1
word
2
word
3
word
4
CGC concepts as features
“Cat”
Leopard
Mouse
Veterinarian
PetsTennis
ball
Elephant
House
CGC concepts as features69
Cat
Tom &
Jerry
Cat [0.92] Leopard [0.84] Roar [0.77]
Cat [0.95] Felis [0.74] Claws [0.61]
Animation [0.85] Mouse [0.82] Cat [0.67]
CatCat
0.95
Panthera
0.92
Tom & Jerry
0.67
Concept and word representation in Explicit Semantic Analysis[Gabrilovich & Markovitch ’09]
70
Panthera
CGC concepts as features
71
Leopar
d0.84
Mac OS X
v10.5
0.61
Cat
0.52
Panthera
CatCat
0.95
Panthera
0.92
Tom & Jerry
0.67
Computing semantic relatednessas cosine between two vectors
CGC concepts as features
Experimental results (individual words)
Algorithm Correlation with
human judgments
WordNet-based 0.33–0.35
Roget’s Thesaurus 0.55
LSA 0.56
WikiRelate 0.49
WLM 0.69
ESA (ODP) 0.65
ESA (Wikipedia) 0.75
TSA [Radinsky et al. ‘11] 0.8
EWC (ESA/Wikipedia + WordNet + collocations)
[Haralambous & Klyuev, forthcoming at IJCNLP’11]
0.79-0.87
CGC concepts as features72
Using Wikipedia for Named Entity Disambiguation [Bunescu & Pasca ’06]
• Both the method and the dataset are based on Wikipedia
• Dataset: 1.7M labeled examples– ''The [[Vatican City | Vatican]] is now an independent enclave ...'‘
Geographical location, not the Holy See
CGC for WSD
The basic approach:
compute the cosine
similarity between the
word context and the
target articles
73
Problems with the basic approach
1. Short articles and stubs
2. Synonymy (lexical mismatch between the word context and the article)
Solution: use categories for generalization
CGC for WSD74
Word-Category Correlations
“John Williams and the Boston Pops a summer Star Wars concert at Tanglewood.”conducted
John Williams (composer)
?
John Williams (wrestler)
Film score composers
Composers
Musicians
Professional wrestlers
Wrestlers
People known in connection
with sports and hobbies
People by occupation
When a Wikipedia article is too short,
get additional context from its categories
CGC for WSD75
The full solution [Bunescu & Pasca ’06]
• ML-based approach (SVM) to rank candidate senses
• Features:
– Cosine(word context, article text)
– Word-category affinities (binary features indicating if a word has occurred in any article of a given category)
• Dimensionality (|V| x |C|) can be reduced by thresholding word frequencies
• Accuracy improvement vs. cosine-only: 3%-25.5%
CGC for WSD76
Results [Bunescu & Pasca ’06]
• The classification model can be trained on Wikipedia, then applied to any input text– Senses can only be resolved to those
defined in Wikipedia
• Another idea to overcome the lexical mismatch: augment Wikipedia articles with contexts of corresponding word senses – Cf. anchor text aggregation [Metzler et al. ’09] and
[Cucerzan ‘07]
CGC for WSD
Taxonomy Kernel
77
An alternative approach to named entity disambiguation [Cucerzan ‘07]
• Applicable to any text, not solely that from Wikipedia
– Sense inventory still limited to Wikipedia
• The system performed both NE identificationand disambiguation
• World knowledge used– Known entities (Wikipedia articles)
– Their class (Person, Location, Organization)
– Known surface forms (redirection pages + anchor text)
– Words that describe or co-occur with the entity
– Categories, lists, enumeration, tables
CGC for WSD78
Restricted document representation:aggregation of features of all identified surface forms
for all entity names
CGC for WSD79
Entities that can be
referred to as “Columbia”
Key idea: joint disambiguation
• Maximize the agreement – between the entity vectors and the document, and
– between the categories of the entities
• Example: {Columbia, Discovery}– Space Shuttle Columbia & Space Shuttle Discovery
– Columbia Pictures & Discovery Investigations
– LIST_astronomical_topics, CAT_manned_spacecraft, CAT_space_shuttles
• If the “one sense per discourse” assumption is violated – shrink the context iteratively
CGC for WSD80
Results [Cucerzan ‘07]
• Baseline: most frequent Wikipedia sense
• Wikipedia dataset (labeled by construction)
– 86.2% 88.3%
• MSNBC news stories (human labeling)
– 51.7% 91.4%
– Surface forms common in news are highly ambiguous
– Popularity in Wikipedia ≠ popularity in news
CGC for WSD
Blackberry
China
81
Case Study: Searching Tweets
3/28/2016 CS 572: Information Retrieval. Spring 2016 82
Twitter Statistics (2014)
• 284 million monthly active users
• 500 million tweets are sent per day
• More than 300 billion tweets have been sent since company founding in 2006
• Tweets-per-second record: one-second peak of 143,199 TPS.
• More than 2 billion search queries per day.
3/28/2016 CS 572: Information Retrieval. Spring 2016 83
Twitter Search Architecture
3/28/2016 CS 572: Information Retrieval. Spring 2016 84
Twitter Search Architecture
3/28/2016 CS 572: Information Retrieval. Spring 2016 85
Twitter Search Architecture
3/28/2016 CS 572: Information Retrieval. Spring 2016 86
Active Index Segments: Write Friendly
3/28/2016 CS 572: Information Retrieval. Spring 2016 87
Twitter Search Architecture
3/28/2016 CS 572: Information Retrieval. Spring 2016 88
Twitter Lucene Extensions
3/28/2016 CS 572: Information Retrieval. Spring 2016 89
Twitter Lucene Extensions (cont’d)
3/28/2016 CS 572: Information Retrieval. Spring 2016 90
Lucene Extensions: Storage
3/28/2016 CS 572: Information Retrieval. Spring 2016 91
In-Memory Real Time Index
3/28/2016 CS 572: Information Retrieval. Spring 2016 92
Revisit Old Idea: Probabilistic Skip Lists
3/28/2016 CS 572: Information Retrieval. Spring 2016 93
3/28/2016 CS 572: Information Retrieval. Spring 2016 94
Top Entities (cont’d)
3/28/2016 CS 572: Information Retrieval. Spring 2016 95
Resources
• CGC tutorial at WSDM 2012:http://www.mathcs.emory.edu/~eugene/wsdm2012-cgc-tutorial/
• Real-time Search at Twitter: https://www.youtube.com/watch?v=cbksU0qkFm0
http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.pdf
http://www.slideshare.net/lucidworks/search-at-twitter-presented-by-michael-busch-twitter
3/28/2016 CS 572: Information Retrieval. Spring 2016 96