View
215
Download
1
Tags:
Embed Size (px)
Citation preview
1
I256: Applied Natural Language Processing
Marti HearstNov 8, 2006
2
Today
Comparing term clustering and category outputClustering in WekaData mining from blogs
3
LDA
Latent Dirchelet AllocationBlei, Ng, Jordan, JLMR 03.LDA is a hierarchical probabilistic model of documents. “LDA allows you to analyze of corpus, and extract the topics that combined to form its documents.”http://www.cs.princeton.edu/~blei/lda-c/Not really clustering, but in the “soft clustering” ballpark.
4
LDA on Recipeshttp://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-newblei/
Flamenco
5
LDA on Recipeshttp://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-newblei/
Flamenco
6
CastaNet
(Semi)automated facet creationStoica & HearstBuild up from WordNetAlgorithm is fully automatic but we think you can improve results manually afterwards.
7
CastaNet on Recipeshttp://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-automated/
Flamenco
8
CastaNet on Recipeshttp://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-automated/
Flamenco
9
TopicSeek on Enron EmailTechnique: pLSI (probabilistic LSI, Hofmann 99)Hand-picked example for websitehttp://topicseek.com/enron.html
10
TopicSeek on MedlineTechnique: pLSI (probabilistic LSI, Hofmann 99)Hand-picked example for websitehttp://topicseek.com/pubmed.html
11
CastaNet on Medline Journal Titleshttp://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/medicine-automated/
Flamenco
12
Clustering in Weka
13
14
15
16
Looking at Clustering Results
Weka lets you save cluster results to an ARFF fileI wrote some python code to process this file and pull out the Subject headings for each newsgroup posting in each cluster.
17
15-way clustering
18
19
Cobweb clustering
20
21
Blog Analysis
What’s special about blogs?
22
Blog analysis sites
http://dijest.com/bc/Called blogcount; lots of stats and news about blogs
http://blogcensus.net/?page=toolsLanguage, location, marketshare
http://www.perseus.com/blogsurvey/Stats about biggest blogs, demographics
http://www.weblogs.com/Notify when new content posted
http://blogpulse.com/Trends and recent popular topics
23
Blogs vs. Newsgroups
Posting about products … what can we tell?Blog:
Newsgroup:
Example from Glance, Hurst, and Tomokiyo ‘04
24
Analyzing Blogs for Market Data
Figure from Glance, Hurst, Nigam, Siegler, Stockton, & Tomokiyo, KDD’05
Idea: examine comments about a product (or a product’s competition or market) in an automated fashion.Application area: handheld electronic devices.
25
Analyzing Blogs for Market Data
Figure from Glance, Hurst, Nigam, Siegler, Stockton, & Tomokiyo, KDD’05
26
Technology usedPost segmentationImportant phrases
Foreground vs. background corpus– Background: text about product– Foreground: certain negative paragraphs about product
Sentiment classificationWhat do people talk about when saying negative things about product X?
Social network analysis (on discussion boards)What does this group of people talk about when saying negative things about product X?Author dispersion
– Many people talking about it, or just a few?
27
Example
What common phrases to people use when saying negative things about product X?
28
ExampleWhat do people in this group say when saying negative things about product X?
29
Example
What do people in this group say when saying negative things about product X?
30
Predicting Film Sales
Idea: Use discussion before a film to predict its opening weekend box office scoresUse discussion afterwards to predict longer-term salesSeparate out topic labels from sentiment labels
Outcome:Good predictor for opening weekend, but not for longer termObservation: the nature of discussion gets (and thus harder to analyze) after the film has been out a while.
Example from Mishne & Glance, 2006
31
Predicting Film Sales
Example from Mishne & Glance, 2006
32
Prediction Film Sales
Example from Mishne & Glance, 2006
33
Predicting Film Sales
Example from Mishne & Glance, 2006
34
Analyzing Political Blogs
Analyze:Who links to whomWhat the popularity profile looks like
– A powerlaw/Zipf/Pareto, of course
Look at structure of topic-specific blogs
– By #inbound links
Image from blogsphere ecosystem via Shirky
35
Analyzing Political Blogs
Earlier work examined books bought together in pairs at major retailers
Krebs, Divided we Stand??? http://www.orgnet.com/leftright.html
In other domains the groupings are more distributed.
36 http://www.orgnet.com/booknet.html
37 http://www.orgnet.com/leftright.html from Jan 2003
38 http://www.orgnet.com/divided.html from 2004 election
39
Analyzing Political Blogs
Study by Adamic and Glance, 2005Analyzed 40 most popular political blogs2 months preceding 2004 US presidential electionAlso study 1000 political blogs on a one day snapshotFindings for the latter:
Liberal and conservative blogs had distinct lists of favorate news sources, people, and topics, with some overlap on current news
– Use labels from aggregator sources
Linking patterns were indeed pretty internal (91% stayed within political leaning)More and more frequent linking among conservatives
– 82% conservative linked out vs. 74% of liberal
40
Analyzing Political Blogs
For the 40 most popular blogs:Looked for “echo chamber” effect
The conservative blogs are more tightly interlinked.Question: do they repeat the same concepts more?
– Measured textual similarity among blog posts– Slightly stronger within a political leaning than
between, but not one orientation more than the other.
Looked for interaction with “mainstream” media
Found strong distinctions between which sources cited
41 Image from Adamic & Glance 200
42 Image from Adamic & Glance 200
43 Image from Adamic & Glance 200
44 Image from Adamic & Glance 200
45 Image from Adamic & Glance 200
46 Image from Adamic & Glance 200
47
Next Time
Sentiment and Opinion Analysis