Upload
dave-king
View
4.400
Download
0
Embed Size (px)
DESCRIPTION
HICSS 45 Tutorial on Mining and Analyzing Social Media Part 1. David King. Jan 4, 2012
Citation preview
Mining and Analyzing Social Media
HICSS 45 Tutorial – Part 1
Dave King
January 4, 2012
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Agenda: This is how the slides are organized
2
• Part 1
– Introduction – Bio, Resources, Social Media
– Data Mining – Processes and Example
– Text Mining – General Processes and Example
– Predicting the Future – The Portmanteaus
• Part 2
– Sentiment Analysis
– Social Network Analysis - Introduction
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Biography: Dave King
• Currently, EVP of Product Development
and Management at JDA Software
• 30 years in enterprise package
software business
• 15 years as university professor
• 14 years as Co-Chair of the Internet &
Digital Economy Track (HICSS)
• Long time interest in various aspects of
E-Commerce & Business Intelligence
• Tutorial topic primarily reflects a
personal interest and tangentially a
job(s) related interest.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Personal Experiences with Analytics
• Taught applied statistics, math modeling & mathematical sociology
• In software R&D for 30 years – Optimization in the 80s
– Natural Language Frontends • NLI Query & CMU Robotics Lab
– EIS Competitive Analysis • Dow Jones and Reuters
• Verity Topics
• NewsAlert
– InXight’s Hyperbolic Tree
– Supply Chain Analytics
• In the case of text analysis and it’s practical application, often
audiences have been small, bewildered, and fleeting
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Mining and Analytics Resources
5
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Mining and Analytics Resources
6
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Mining and Analytics Resources
7
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Mining and Analytics Resources
8
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Mining and Analytics Resources: Web Sites, Online Books & Tutorials
9
• DM/Blog -- abbottanalytics.blogspot.com
• DM/Blog – blog.data-miners.com
• DM/Blog -- bx.businessweek.com/data-mining/blogs
• DM/Blog -- bytemining.com
• DM/Blog – data-mining.alltop.com
• DM/Blog -- dataminingblog.com
• DMBlog – dataminingdownunder.com
• DM/Blog -- datamining.typepad.com
• DM/Blog -- datawrangling.com
• DM/Blog -- timmanns.blogspot.com
• DM/General -- kdnuggets.com
• DM/General -- mydatamine.com
• DM/General -- the-data-mine.com
• DM/Online Book -- chem-eng.utoronto.ca/~datamining/dmc/data_mining_map.htm
• DM/Tutorial -- autonlab.org/tutorials/
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Mining and Analytics Resources: Web Sites, Online Books & Tutorials
10
• TA/General -- social.textanalyticsnews.com
• TA/General -- textanalysis.info
• TM/Blog -- blogs.sas.com/text-mining
• TM/Blog -- lingpipe-blog.com
• TM/Blog -- texttechnologies.com
• TM & TA/Blog -- informationweek.com/authors/showAuthor.jhtml?authorID=1331
• TA Tutorial -- slideshare.net/SethGrimes/text-analytics-overview-2011
• TM & DM/Online Book -- statsoft.com/textbook/text-mining/
• TM & DM/Tutorial -- alias-i.com/lingpipe/demos/tutorial/db/read-me.html
• TM Tutorial -- scienceforseo.com/tutorials/text-mining-tutorial
• TM/Wiki -- textanalytics.wikidot.com
• SNA/Blog – iq.harvard.edu/blog/netgov/2011/10/
• SNA/Blog – thenetworkthinkers.com
• SNA/Blog – blog.echen.me/tag/social-network-analysis/
• SNA/Blog – lithosphere.lithium.com/t5/user/viewprofilepage/user-id/151
• SNA/Tutorial -- cs.stanford.edu/people/jure/icml09networks/
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Mining and Analytics Resources: Web Sites, Online Books & Tutorials
11
• DA/Blog – dataists.com
• DA/Blog – drewconway.com
• Visualization/Blog – abeautifulwww.com/
• Visualization/Blog – benfry.com/writing/
• Visualization/Blog -- blog.blprnt.com
• Visualization/Blog – chrisharrison.net/index.php/visualization.com
• Visualization/Blog – datavisualization.ch/
• Visualization/Blog – eagereyes.com
• Visualization/Blog – informationandvisualization.de/
• Visualization/Blog – infosthetics.com
• Visualization/Blog – junkcharts.typepad.com/junk_charts/
• Visualization/Blog – neoformix.com
• Visualization/Blog – perpetualedge.com/blog
• Visualization/Blog – processing.org
• Visualization/Blog – visualcomplexity.com
• Visualization/Blog – well-formed-data.net/
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media Defined
Marta Kagan
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media Defined: …Sort of …
13
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media Defined: Actually, it’s 33 Definitions
1. Media for social interaction, using highly accessible and scalable
communication techniques.
2. Various user-driven (inbound marketing) channels (e.g., Facebook, Twitter,
blogs, YouTube).
3. Most transparent, engaging and interactive form of public relations
4. What we do and say together, worldwide, to communicate in all direction at
any time, by any possible (digital) means.
5. New marketing tool that allows you to get to know your customers and
prospects in ways that were previously not possible.
6. Platforms that enable the interactive web by engaging users to participate in,
comment on and create content as means of communicating
7. Consists of any online platform or channel for user generated content.
8. Digital content and interaction that is created by and between people.
9. Shift in how we get our information. Social media allows us to network, to find
people with like interests, and to meet people who can become friends or
customers.
10. Platforms for interaction and relationships, not content and ads.
11. Online platforms and locations that provide a way for people to participate in
these conversations.
12. People’s conversations and actions online that can be mined by advertisers
for insights but not coerced to pass along marketing messages.
13. Tools, services, and communication facilitating connection between peers
with common interests.
14. Online technologies and practices that people use to share content, opinions,
insights, experiences, perspectives, and media themselves.
15. Ever-growing and evolving collection of online tools and toys, platforms and
applications that enable all of us to interact with and share information.
Increasingly, it’s both the connective tissue and neural net of the Web.
16. Reflection of conversations happening every day, whether at the supermarket,
a bar, the train, the watercooler or the playground.
17. Online text, pictures, videos and links, shared amongst people and
organizations.
18. Not one thing. It’s five distinct things:
19. Digital, content-based communications based on the interactions enabled by a
plethora of web technologies
20. Collection of online platforms and tools that people use to share content,
profiles, opinions, insights, experiences, perspectives and media itself,
facilitating conversations and interactions online between groups of people.
21. Platform/tools.
22. Act of connecting on social media platforms.
23. How businesses join the conversation in an authentic and transparent way to
build relationships.
24. The notion that social media is about the technology that facilitates individuals
and groups of people to connect and interact, create and share.
25. Any of a number of individual web-based applications aggregating users who
are able to conduct one-to-one and one-to-many two-way conversations.
26. Media channel that relies on listening and conversation, as opposed to a
monologue, to get your point across, make a connection and build a
relationship.
27. Social media is all about leveraging online tools that promote sharing and
conversations, which ultimately lead to engagement with current and future
customers and influencers in your target market.
28. Social media: Evolution, Revolution and Contribution -by the ability of
everybody to share and contribute as a publisher
29. Social media is communication channels or tools used to store, aggregate,
share, discuss or deliver information within online communities.
30. Social Media is simply another arrow to be shot in a company’s marketing
quiver.
31. Social media platforms make it easier to share information–usually online.
32. Any object or tool, that connects people in dialogue or interaction — in
person, in print, or online.
33. Wild, Wild West of Marketing, with brands, businesses, and organizations
jostling with individuals to make news, friends, connections and build
communities in the virtual space.
14
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media Defined: If a Picture isn’t
worth a 1000 words, then …
15
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media Defined
16
Online technologies and practices
for social interaction
enabling the sharing of opinions, insights,
experiences, perspectives and media itself
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media Defined: Categories
17
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media Defined: Unanimous Agreement
18
Marta Kagan
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media is Huge: Users
19
200 Million: Twitter
100 Million: LinkedIn
750 Million: Facebook
Marta Kagan
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media is Huge!
20
If Facebook
were a country,
it would be the
3rd largest in
the world
Marta Kagan
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media Data: Research Opportunity
21
“Every day, Twitter
generates more
social network
data than the
entire field of SNA
possessed 10
years ago.”
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media is Huge:
Usage and Content
22
Name Name
(Symbol) (Symbol)
kilobyte (kB) 3 kibibyte (KiB) 210 = 1.024 × 103
megabyte (MB) 6 mebibyte (MiB) 220 ≈ 1.049 × 106
gigabyte (GB) 9 gibibyte (GiB) 230 ≈ 1.074 × 109
terabyte (TB) 12 tebibyte (TiB) 240 ≈ 1.100 × 1012
petabyte (PB) 15 pebibyte (PiB) 250 ≈ 1.126 × 1015
exabyte (EB) 16 exbibyte (EiB) 260 ≈ 1.153 × 1018
zettabyte (ZB) 21 zebibyte (ZiB) 270 ≈ 1.181 × 1021
yottabyte (YB) 24 yobibyte (YiB) 280 ≈ 1.209 × 1024
10**N Value
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media Data: Part of a Bigger Picture
23
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media Data:
Ways in big data is creating value
24
• Makes information
transparent and usable at
much higher frequency.
• Provides more transactional
data in digital form, that can
be used to improve
performance across the
board.
• Allows ever-narrower
segmentation of customers to
tailor products or services.
• Improves decision-making
through sophisticated.
• Improves the development of
the next generation of
products and services
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Data Mining: Defined
25
Discovering meaningful
patterns from large data
sets using pattern
recognition technologies.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
26
Data Mining: CRISP-DM
Data Consolidation
Data Transformation
Data Cleaning
Data Reduction
Well-Formed
Data
Real-World
Data
Cross-Industry Standard Process for Data Mining
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
26
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Data Mining: General Data Assumptions
Structured
Transformed
Well-Formed
27
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Data Mining: Example
28
Affinity Analysis
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Data Mining: Example
1. Market Basket Analysis: Items for Sale:
Apples Bananas Cherries
2. Possible Transactions: With one item or a collection of items selected as
the Driver or Independent Variable
3. Objective is to empirically determine those groups of items that occur
frequently together in a set of transactions, producing a set of rules of the
form X -> Y.
No
1 A B
2 A C
3 A B C
4 B A
5 B C
6 B A C
X Y No
7 C A
8 C B
9 C A B
10 A B C
11 A C B
12 B C A
X Y
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Data Mining: Example
30
Standard Market Basket Measures:
Support: Rule’s coverage (% match antecedents)
N(X & Y)/ N(T) Example: N(A & B)/ N(T) = 2/7 = 29%
Confidence: Rule’s predictive ability (% consequent | antecedent)
N(X & Y)/ N(X) Example: N(A & B)/ N(A) = 2/4 = 50%
Lift: Predictive improvement (ratio of observed support for X&Y to support if X& Y
independent -- S(XuY)/S(X)S(Y) Example: (2 x7)/(4/7)(5/7) = .7 or 70%
Transaction ID Items
1 Apple
1 Banana
1 Cherry
2 Apple
3 Banana
3 Cherry
4 Banana
4 Cherry
5 Apple
5 Banana
6 Apple
6 Banana
7 Apple
7 Cherry
8 Apple
8 Banana
9 Apple
9 Banana
9 Cherry
10 Apple
10 Banana
1 1 1 1
2 1 0 0
3 0 1 1
4 0 1 1
5 1 1 0
6 1 1 0
7 1 0 1
8 1 1 0
9 1 1 1
10 1 1 0
Sum 8 8 5
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Data Mining: Example
31
Rule selection usually based
on minimum support & confidence
No N(XuY) N(T) S(XuY) N(X) Conf N(Y) S(X) S(Y) Lift Rule
1 A B 6 10 60% 8 75% 8 80% 80% 94% Ok
2 A C 3 10 30% 8 38% 5 80% 50% 94%
3 A B C 2 10 20% 8 25% 4 80% 40% 78%
4 B A 6 10 60% 8 75% 8 80% 80% 117% Ok
5 B C 4 10 40% 8 50% 5 80% 50% 125%
6 B A C 2 10 20% 8 25% 3 80% 30% 104%
7 C A 3 10 30% 5 60% 8 50% 80% 150%
8 C B 4 10 40% 5 80% 8 50% 80% 200% Ok
9 C A B 2 10 20% 5 40% 6 50% 60% 133%
10 A B C 2 10 20% 6 33% 5 60% 50% 111%
11 A C B 2 10 20% 3 67% 8 30% 80% 278%
12 B C A 2 10 20% 4 50% 8 40% 80% 156%
X Y
Parameters
Min. Support 40%
Min. Confidence 75%
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Data Mining: Simple Example
But, what if the baskets were described in the
following manner:
– Jane bought a handful of maraschinos and a couple of
granny smiths.
– Harold purchased a bag of appls and 2 bananas.
– Bill paid for a pound of cherries but decided not to buy
the three durians because of their odor.
How could we automate the analysis?
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media Data:
33
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Social Media Data: Commonality?
34
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: Defined
35
Using data mining to discover patterns
in a collection of documents
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: CRISP-Like Processes
Business
Understanding
Document
Understanding
Document
Preparation
Modeling
Evaluation
Deployment
Document
Consolidation
Corpus Refinement
(Token, Stem, Stop…)
Establish the
Corpus
Feature Selection
& Weighting
Term-
Doc-Matrix*
Real-World
Text Data
Documents
36
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining Process: Sample Corpa
• Brown Corpus – first million word corpus compiled in 60s at
Brown U., 500 samples across 15 genres, each ~2000 words with
POS tags (Lancaster-Oslo-Bergen Corpus – British equivalent)
• Linguistic Consortium Treebanks – collections of manually
tagged and parsed (tree structures) of sentences from a variety of
sources (includes well-known Penn Treebank collection)
• Reuters 21578, RCV1 & V2, TRC2 -- collections (1000s of)
Reuter’s English & multi-lingual news stories classified into topics and
grouped into training & test sets
• Pang & Lee’s Sentiment Analysis – 1000 positive and 1000
negative movie reviews
• MEDLINE – An extensive collection of articles and abstracts
(18M+) used in a variety of biomedical and linguistic text mining
applications
• WordNet® -- large lexical database of English grouped into sets of
cognitive synonyms (synsets) and interlinked by means of
conceptual-semantic and lexical relations.
• 20 Newsgroups -- collection of approximately 20,000 newsgroup
documents, partitioned (nearly) evenly across 20 different
newsgroups each representing a different topic.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining Process: Corpus Refinement
• Tokenization —Parse the text to generate terms. Sophisticated
analyzers can also extract phrases from the text.
• Normalize — Convert them to lowercase.
• Eliminate stop words — Eliminate terms that appear very often (e.g.
the, and, …).
• Stemming — Convert the terms into their stemmed form—remove
plurals and different word forms (e.g. achieve, achieves, achieved –
achiev) [note: word about synonyms – WordNet Synset]
Tokenization Normalize Eliminate
Stop Words Stemming
Common representation of tokens within and between documents
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: Feature Extraction & Weighting
Feature
Extraction
Vector Representation ->
Word, Term, Token or Pairs-Triplets
x Doc Matrix
“Bag of Words, Terms
or Tokens”
Words or Tokens are
attributes and documents
are examples
Token1 Token2 Token3 Token4 …
Doc1 1 2 2 4
Doc2 4 2 3 0
Doc3 1 1 1 0
Doc4 1 1 1 2
…
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: Transforming Frequencies
• Binary Frequencies: tf =1 for tf>0; otherwise 0
• Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K
• Log Frequencies: 1 + log(tf) for tf>0; otherwise 0
• Normalized Frequencies: Divide each frequency by SQRT of Sum of Squares of the frequencies within the vector (column)
• Term Frequency–Inverse Document Frequency – TF * IDF
– Inverse Document Frequency: log(N/(1+D)) where N is total number of docs and D is number with term
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: Simple Example
41
Listening Post is an art installation by Mark
Hansen and Ben Rubin that culls text
fragments in real time from thousands of
unrestricted Internet chat rooms, bulletin
boards and other public forums.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: Simple Example
42
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: Simple Example
43
Blogs “I feel”
“I’m feeling”
Every
10 Mins
Contains
1 of 5000
Pre-Determined
Feelings
15-20K
Feelings
Per Day
sentence
imageid
feeling
posttime
postdate
posturl
gender
born
country
state
city
lat
lon
conditions
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: Simple Example
44
API
http://api.wefeelfine.org
:8080/
ShowFeelings?
display=xml&
returnfields=
Sentence
&postdate=2010-11-25
&limit=500
Query Result <?xml version="1.0" ?>
<feelings>
<feeling imageid="-
mZmybPrOGTZ+xukpcU7jg"
feeling="better"
sentence="i feel almost 100 better
aside from that weird sandy feeling in
my throat"
posttime="1321633467"
postdate=2010-11-25="0"
posturl="http://jenngreenleaf.blogspot.com
/2011/11/im-coming-down-with-cold-or-
am-i.html"
gender="0" country="united states"
state="maine" city="richmond"
lat="44.091522" lon="-69.801787"
conditions="4" />
…
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: Simple Example
• i'm done believing you don't know what i'm feeling
• i feel so out of place
• i'm feeling healthy
• i never feel down when i'm with her
• i love the feeling
• i feel like i've been run over by a truck
• i feel so positive today
• i feel like a poor man's pin up girl
45
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: Simple Example
• Input String (128925 chars; 24282 spaces) – "i have found to be helpful especially during those times when i am feeling
discouraged\ni have a 50km commute and just the lack of the sense of freedom that driving brings just leaves me feeling scared\ni seem to be feeling better mostly…"
• Tokenize (26465 tokens) – ['i', ', 'have', 'found', 'to', 'be', 'helpful', 'especially', 'during', 'those', 'times', 'when', 'i',
'am', 'feeling', 'discouraged', 'i', 'have', 'a', '50km', 'commute', 'and', 'just', 'the', 'lack', 'of', 'the', 'sense', 'of', 'freedom', 'that', 'driving', 'brings', 'just', 'leaves', 'me', 'feeling', 'scared', 'i', 'feel', 'noone', 'know', 'if', 'you', 'were', 'me', 'you', 'will', 'feel', 'the', 'same', 'way‘, …]
• Set of Tokens (3045 distinct tokens) – ["'", "'believe", "'d", "'en", "'encoding", "'feedlinks", "'forever", "'gets", "'http",
"'ismobile", "'isprivate", "'item", "'languagedirection", "'ll", "'locale", "'ltr", "'m", "'mefaked", "'mobileclass", "'mr", "'no", "'okay", "'on", "'pagetitle", "'pagetype", "'re", "'s", "'t", "'toned", "'url", "'us", "'utf", "'ve", "'yes", '0', '034', '039', '0aeverytime', '0d', '10', '100', '101',…]
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: Simple Example
Corpus Word Length Sentence Length Lexical Diversity
We Feel Fine 4 17 8
Gutenberg Corpus
Austen-persuasion.txt 4 23 16
Bible-kjv.txt 4 33 79
Blake-poems.txt 4 18 5
Carroll-alice.txt 4 16 12
Melville-moby.txt 4 24 15
Milton-paradise.txt 4 52 15
Shakespeare-caesar.txt 4 12 8
Shakespeare-hamlet.txt 4 13 7
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: Simple Example
• Eliminate Stopwords (175 words - 'a', 'about', 'above', 'after', …) – Set of tokens (12827) with stopwords eliminated ['ab', 'abit', 'able', 'abs',
'absolute', 'absolutely', 'absorb', 'abuse', 'accomplished', 'accomplishment', 'achieve', 'achieved', 'across', 'acted', 'action', 'activities', 'activity', 'actually', 'acura', 'add', …]
– Content (11896 or 45% of tokens not stopwords – 4053 with tokens starting with apostrophes and #s eliminated )
• Stemming – Stemmed tokens (11896) ['abdomen', 'abdul', 'abil', 'abl', 'abrupt', 'absolut',
'abstract', 'academ', 'accept', 'accid', 'accomplish', 'accur', 'accus', 'accustom', 'achi', 'achiev', 'acknowledg', 'across', 'action', 'activ‘…]
– Set of tokens in stemmed content(2283) ['abdomen', 'abdul', 'abil', 'abl', 'abrupt', 'absolut', 'abstract', 'academ', 'accept', 'accid', 'accomplish', 'accur', 'accus', 'accustom', 'achi', 'achiev',…]
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: Simple Example
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: Simple Example
50
Sum 416 94 90 89 83 80 80 76 76 75 … 16 16 16 16 16 16 16 16 16
Sum WeFeel like know time go think better way get good love … hear didn place almost comfort everyonsinc babi actual
3 comment1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 comment2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 comment3 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
1 comment4 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1 comment5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 comment6 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 comment7 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
7 comment8 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 comment9 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 comment10 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
… …
2 comment1490 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 comment1491 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
6 comment1492 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
3 comment1493 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 comment1494 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 comment1495 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 comment1496 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 comment1497 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 comment1498 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 comment1499 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Document-Term Matrix
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Text Mining: Simple Example
Madness Murmerings Montage
Mounds Metrics Mobs
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction
52
Collective, macroscopic
trends which can be
scientifically inferred by
harnessing publicly
accessible data from
the Internet.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Characteristics
53
Public
Practical
Big
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Sources
54
What we surf
Whom we “friend”
What we say
Where we go
What we buy
How we play
Easily accessible digital traces:
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Sample Studies
55
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Sample Studies
56
Infodemiology
Nowcasting
Culturomics
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology
57
Information + Epidemiology:
Science of distribution and
determinants of information
in an electronic medium,
specifically the Internet, or
in a population, with the
ultimate aim to inform public
health and public policy
Coined by Gunther Eysenbach, Univ. of Toronto
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology A Major Application - Practical
58
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology A Major Application - Practical
59
Vi
Regional, Weekly Syndromic Surveillance
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology An Alternative Approach
60
Text Mining of Worldwide Newswires, Web Sites
and Various Offline Reports
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology Utilizing Aggregate Search Data
Monitoring and analyzing
queries from Internet search
engines or peoples' status
updates on microblogs for
syndromic surveillance to
predict disease outbreaks
61
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology Utilizing Aggregate Search Data
62
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology Utilizing Aggregate Search Data
63
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology Utilizing Aggregate Search Data
64
Dependent
Variable at
Time t
(Standard
Publicly
Available
Measure)
=
Traditional,
Publicly
Available
Explanatory
Variable
Aggregate
Search
Index or
Social
Media
Freq.
Count
b0 + b1 + b2 + e
Dependent
Variable at
Time t - n
(Standard
Publicly
Available
Measure)
+ b3
Standard Linear Prediction Model
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology Utilizing Aggregate Search Data
“Detecting Influenza Epidemics Using Search
Engine Query Data” (Ginsberg et. al.), 2/19/09
• Aggregating historical logs of search queries
from 2003-2008, computing weekly time series
• Logit(P) = b0 + b1 * logit(Q) + e
– P – percentage of ILI physician visits
– Q – query fraction 45 highest influenza queries
• r is between .80-.96 for 9 regions
65
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology Utilizing Aggregate Search Data
66
http://www.google.org/flutrends/about/how.html
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology Utilizing Aggregate Search Data
67
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology A Similar Application
68
http://www.google.org/denguetrends/
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology Utilizing Tweets
69
?
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology Utilizing Tweets
70
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology Utilizing Tweets
“Nowcasting Events from the Social Web
with Statistical Learning,” Lampos and
Cristianini, ACM IS&T, 9/11
• Text analysis of 50M tweets for 3 regions of UK
from 6/09-4/10 (303 days)
• HPA weekly reports of GP consultations with ILI
diagnosis correlated with number of “hybrid
grams”
• Average “r” of .911 71
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology A Major Application – Text Analysis
72
Tokens Lower
Case
Stop
Words Stems
Corpus
Corpus
Refinement
Feature
Selection
1-
Grams
2
Grams
Hybrid
Grams
50M Tweets
3 Region UK, 6/09-4/10
N-Gram
Freqs
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology Utilizing Tweets
73
Discarded
when
n<50
BoLasso - Bootstrap LASSO (least absolute shrinkage and selection operator
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology Utilizing Tweets
74
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Infodemiology Utilizing Tweets
75
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction:
76
Now + Forecasting:
Predicting the present
by analyzing large
volumes of data that
can be used to
"forecast" current
events for which
official analysis has
not been released
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Nowcasting Weather Envy
77
Within the next 6 hours …
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Sample Studies with Search
78
Authors Date (Mnth-Year) Dependent Variables Explanatory Variables Model Results
Song, Pan, Ng Apr-10 Weekly Hotel Bookings in
Charleston, SC
Indexed Search Volumes from
Google Trends/Insights Jan
2008-Aug 2009
Log of Room Nights for Log of Search
Volumes - Charleston, Travel Charleston,
Charleston Hotels, Charleston
Restaurants, Charleston Tourism
Test various statistical models; all gave
reasonable forecasts. Best fit model
was Autoregressive Distributed Lag
(ADLM) with a lag period of 6 weeks.
Kholodilin,
Podstawski,
Sliliverstovs
Apr-10 Year-on-Year Growth Rate
of Monthly US Real
Private Consumption,
ALFRED db of Fed Rsrv of
St. Louis
220 Google Trend/Insights
Search terms related to Priv
Consumption reduced to 10
principal components for
montly periods from Jan 2005
to Dec 2009
Y-o-Y monthly URPC growth rates for 3
sets of regressors -- Sentiment
(consumer sentiment and confidence);
Financial (short term and long term
interest rates and S&P 500); Query
(combinations of principal components of
query terms)
Query term principal components
outperform standard Sentiment and
Financial Indicators. A combination of
two of the factors work best -- those
related to mobility and health care
consumption.
Choi, Varian Apr-09 US Census Bureau
Advance Monthly Retail
Sales (general and
specific) and Travel
(Visitor arrival in Hong
Kong)
Google Trend/Insight query
indices for categories and
subcategories related to retail
sales (general and specifix)
and related to Travel
Google Trend indices for query
subcategories related to (log values) of
overall monthly retail trade (NAICS
categories), automotive sales, home
sales and travel.
Simple seasonal AR models and fixed-
effects models that includes relevant
Google Trend variables tend to
outperform models that exclude these
variables. In some cases small gains, in
other substantial.
McLaren,
Shanbhogue
Q2-11 Official monthly
unemployment data and
housing price growth in
the UK from June 2004-Jan
2011
Google Trend/Insight query
indexes for the term "Job
Seekers Allowance (JSA)" for
unemployment and "Estate
Agents" for housing
For unemployment, linear AR model
with query term, claimant count, and GfK
consumer confid. as exp vars; for housing
price growth with query term, Home
Builders and Royal Instit. of Chartered
Surveyors price growth balances as exp
vars.
For unemployment forecasts, claimant
count strongest followed by query term.
For housing prices, the query term was
much stronger than HBF and RICS data.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Sample Studies with Social Media
79
Authors Date (Mnth-Year) Dependent
Variables
Explanatory Variables Model Results
Asur,
Huberman
Mar-10 Box-office
revenues for (24)
movies
Promotion tweets-retweets for a particular movie,
tweet rates for particular movie per hour, ratio of
positive to negative sentiments for the movie
Regression of 1st weekend box
office revenues by promotional
tweets-retweets, by tweet rates
vs. Hollywood Stock Exchange
prices, and 2nd weekend
revenues by tweet rates and the
sentiment ratio.
Promotional tweets are weakly
correlated 1st weekend revs. Tweet
rates are very strongly correlated
(min .9) and a stronger predictor than
HSX. Finally, tweet rates are strongly
correlated with 2nd weekend
revenues and sentiments improve
the forecasts slightly.
Gruhl, Guha,
Kumar, Novak,
Tomkins
Aug-05 Amazon Sales
Rank for 2340
bestselling books
in 4 month period
(Jul 2004-Aug
2004) and spikes
in these sales
ranks
Number of mentions of the book/author in over 300K
blogs whose postings that were maintained by IBM's
WebFountain project (over 200K postings/day)
Cross correlation of time series
for sales rank and mentions.
While sales rank is a poor predictor of
the change in sales rankings, a prior
spike in mentions predicts quite well
a future spike in sales rank.
Sadikov,
Parameswaran,
Venetis
Aug-09 Movie critic
ranking, user
ranking, 2008
gross sales,
weekly box office
sales (weeks 1-5)
Basic features that count movie references in blogs,
count movie references taking into account ranking
and indegree of the blogs where they appear,
consider only references made within a time window
before or after a movie release date, features that
consider positive sentiment; and combinations of
these. References based on spinn3r.com blog data
set 11/07-11/08
Linear regression for weekly
rankings and sales data by blog
references and sentiment.
Minimal correlation between
rankings and references and
sentiment. Strong correlation
between references and gross sales
but week with sentiment. Strongest
relationships with timing of
references in weeks after release.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Any Guesses?
80
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Idiom, a Sculpture of 10s of 1000s of Books
81
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: It comes in many Shapes but not Sizes
82
Omphalos
Gravity Mixer
Book Cell
Matej Krén
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Culturnomics
83
Culture + Genomics:
Application of high-
throughput data
collection and analysis
to the study of human
culture.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Culturomics
84
“Quantitative Analysis of Culture Using
Millions of Digitized Books,” Science, 12/16/10.
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Culturomics 2.0
85
http://www.youtube.com/watch?v=61qn7S9NCOs
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Culturomics 2.0
• The tone of real-time consciousness reflected in the media can
be used to forecast broad social behavior.
• Combined three massive news archives totaling more than 100
million articles worldwide to explore the global consciousness
of the news media.
• Employs a large shared-memory supercomputer (University of
Tennessee SGI Altix supercomputer Nautilus with 1024
processors and 4-TB of memory)
• Using the tone and location of the reports, (claims to have)
predicted the outcome of the Arab Spring and the location of Bin
Laden within radius of 125 miles
86
Culturomics 2.0: Forecasting Large-Scale Human
Behavior Using Global News Media Tone in Time
and Space, Kalev Leetaru, 9/11
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Culturomics 2.0 Based on Carbon Capture Report
87
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Culturomics 2.0 Based on Carbon Capture Report
88
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Culturomics 2.0 Features of Stories or Tweets
• Tone/Positivity/Negativity. Ratio of + to - tone (-
100 to 100)
• Polarity. Emotional charge (0 to 100)
• Activity. Intensity of "active language" (0 to 100)
• Personalization. Degree to which the writer
attempts to bring the reader into the fold (0 to
100)
• Questions/Exclamations. Tweet tone indicators of
non-word items
• Geocoding. Location of story content
89
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Culturomics 2.0 Features of Stories or Tweets
90
100M Articles from the:
New York Times (1945-05)
Sum. of Wrld Brdcasts (1979-10)
Google News articles (2006-11)
Sentiment Mining,
Geocoding,
Entity Extraction
Nautilus Supercomputer Feature Scores
Geocoding
2.4 Petabyte
Network with over
10M entitles
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Culturomics 2.0 Predicting Unrest
91
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Culturomics 2.0 NY Times View of Tone
92
http://contentanalysis.ichass.illinois.edu/Culturomics20/nyt-movie-
1000x1000.gif
Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
Prediction: Culturomics 2.0 SWB View of Tone
93
http://contentanalysis.ichass.illinois.edu/Culturomics20/swb-movie-
1000x1000.gif