93
Mining and Analyzing Social Media HICSS 45 Tutorial – Part 1 Dave King January 4, 2012

Mining and analyzing social media hicss 45 tutorial – part 1

Embed Size (px)

DESCRIPTION

HICSS 45 Tutorial on Mining and Analyzing Social Media Part 1. David King. Jan 4, 2012

Citation preview

Page 1: Mining and analyzing social media hicss 45 tutorial – part 1

Mining and Analyzing Social Media

HICSS 45 Tutorial – Part 1

Dave King

January 4, 2012

Page 2: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Agenda: This is how the slides are organized

2

• Part 1

– Introduction – Bio, Resources, Social Media

– Data Mining – Processes and Example

– Text Mining – General Processes and Example

– Predicting the Future – The Portmanteaus

• Part 2

– Sentiment Analysis

– Social Network Analysis - Introduction

Page 3: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Biography: Dave King

• Currently, EVP of Product Development

and Management at JDA Software

• 30 years in enterprise package

software business

• 15 years as university professor

• 14 years as Co-Chair of the Internet &

Digital Economy Track (HICSS)

• Long time interest in various aspects of

E-Commerce & Business Intelligence

• Tutorial topic primarily reflects a

personal interest and tangentially a

job(s) related interest.

Page 4: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Personal Experiences with Analytics

• Taught applied statistics, math modeling & mathematical sociology

• In software R&D for 30 years – Optimization in the 80s

– Natural Language Frontends • NLI Query & CMU Robotics Lab

– EIS Competitive Analysis • Dow Jones and Reuters

• Verity Topics

• NewsAlert

– InXight’s Hyperbolic Tree

– Supply Chain Analytics

• In the case of text analysis and it’s practical application, often

audiences have been small, bewildered, and fleeting

Page 5: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Mining and Analytics Resources

5

Page 6: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Mining and Analytics Resources

6

Page 7: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Mining and Analytics Resources

7

Page 8: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Mining and Analytics Resources

8

Page 9: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Mining and Analytics Resources: Web Sites, Online Books & Tutorials

9

• DM/Blog -- abbottanalytics.blogspot.com

• DM/Blog – blog.data-miners.com

• DM/Blog -- bx.businessweek.com/data-mining/blogs

• DM/Blog -- bytemining.com

• DM/Blog – data-mining.alltop.com

• DM/Blog -- dataminingblog.com

• DMBlog – dataminingdownunder.com

• DM/Blog -- datamining.typepad.com

• DM/Blog -- datawrangling.com

• DM/Blog -- timmanns.blogspot.com

• DM/General -- kdnuggets.com

• DM/General -- mydatamine.com

• DM/General -- the-data-mine.com

• DM/Online Book -- chem-eng.utoronto.ca/~datamining/dmc/data_mining_map.htm

• DM/Tutorial -- autonlab.org/tutorials/

Page 10: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Mining and Analytics Resources: Web Sites, Online Books & Tutorials

10

• TA/General -- social.textanalyticsnews.com

• TA/General -- textanalysis.info

• TM/Blog -- blogs.sas.com/text-mining

• TM/Blog -- lingpipe-blog.com

• TM/Blog -- texttechnologies.com

• TM & TA/Blog -- informationweek.com/authors/showAuthor.jhtml?authorID=1331

• TA Tutorial -- slideshare.net/SethGrimes/text-analytics-overview-2011

• TM & DM/Online Book -- statsoft.com/textbook/text-mining/

• TM & DM/Tutorial -- alias-i.com/lingpipe/demos/tutorial/db/read-me.html

• TM Tutorial -- scienceforseo.com/tutorials/text-mining-tutorial

• TM/Wiki -- textanalytics.wikidot.com

• SNA/Blog – iq.harvard.edu/blog/netgov/2011/10/

• SNA/Blog – thenetworkthinkers.com

• SNA/Blog – blog.echen.me/tag/social-network-analysis/

• SNA/Blog – lithosphere.lithium.com/t5/user/viewprofilepage/user-id/151

• SNA/Tutorial -- cs.stanford.edu/people/jure/icml09networks/

Page 11: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Mining and Analytics Resources: Web Sites, Online Books & Tutorials

11

• DA/Blog – dataists.com

• DA/Blog – drewconway.com

• Visualization/Blog – abeautifulwww.com/

• Visualization/Blog – benfry.com/writing/

• Visualization/Blog -- blog.blprnt.com

• Visualization/Blog – chrisharrison.net/index.php/visualization.com

• Visualization/Blog – datavisualization.ch/

• Visualization/Blog – eagereyes.com

• Visualization/Blog – informationandvisualization.de/

• Visualization/Blog – infosthetics.com

• Visualization/Blog – junkcharts.typepad.com/junk_charts/

• Visualization/Blog – neoformix.com

• Visualization/Blog – perpetualedge.com/blog

• Visualization/Blog – processing.org

• Visualization/Blog – visualcomplexity.com

• Visualization/Blog – well-formed-data.net/

Page 12: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media Defined

Marta Kagan

Page 13: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media Defined: …Sort of …

13

Page 14: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media Defined: Actually, it’s 33 Definitions

1. Media for social interaction, using highly accessible and scalable

communication techniques.

2. Various user-driven (inbound marketing) channels (e.g., Facebook, Twitter,

blogs, YouTube).

3. Most transparent, engaging and interactive form of public relations

4. What we do and say together, worldwide, to communicate in all direction at

any time, by any possible (digital) means.

5. New marketing tool that allows you to get to know your customers and

prospects in ways that were previously not possible.

6. Platforms that enable the interactive web by engaging users to participate in,

comment on and create content as means of communicating

7. Consists of any online platform or channel for user generated content.

8. Digital content and interaction that is created by and between people.

9. Shift in how we get our information. Social media allows us to network, to find

people with like interests, and to meet people who can become friends or

customers.

10. Platforms for interaction and relationships, not content and ads.

11. Online platforms and locations that provide a way for people to participate in

these conversations.

12. People’s conversations and actions online that can be mined by advertisers

for insights but not coerced to pass along marketing messages.

13. Tools, services, and communication facilitating connection between peers

with common interests.

14. Online technologies and practices that people use to share content, opinions,

insights, experiences, perspectives, and media themselves.

15. Ever-growing and evolving collection of online tools and toys, platforms and

applications that enable all of us to interact with and share information.

Increasingly, it’s both the connective tissue and neural net of the Web.

16. Reflection of conversations happening every day, whether at the supermarket,

a bar, the train, the watercooler or the playground.

17. Online text, pictures, videos and links, shared amongst people and

organizations.

18. Not one thing. It’s five distinct things:

19. Digital, content-based communications based on the interactions enabled by a

plethora of web technologies

20. Collection of online platforms and tools that people use to share content,

profiles, opinions, insights, experiences, perspectives and media itself,

facilitating conversations and interactions online between groups of people.

21. Platform/tools.

22. Act of connecting on social media platforms.

23. How businesses join the conversation in an authentic and transparent way to

build relationships.

24. The notion that social media is about the technology that facilitates individuals

and groups of people to connect and interact, create and share.

25. Any of a number of individual web-based applications aggregating users who

are able to conduct one-to-one and one-to-many two-way conversations.

26. Media channel that relies on listening and conversation, as opposed to a

monologue, to get your point across, make a connection and build a

relationship.

27. Social media is all about leveraging online tools that promote sharing and

conversations, which ultimately lead to engagement with current and future

customers and influencers in your target market.

28. Social media: Evolution, Revolution and Contribution -by the ability of

everybody to share and contribute as a publisher

29. Social media is communication channels or tools used to store, aggregate,

share, discuss or deliver information within online communities.

30. Social Media is simply another arrow to be shot in a company’s marketing

quiver.

31. Social media platforms make it easier to share information–usually online.

32. Any object or tool, that connects people in dialogue or interaction — in

person, in print, or online.

33. Wild, Wild West of Marketing, with brands, businesses, and organizations

jostling with individuals to make news, friends, connections and build

communities in the virtual space.

14

Page 15: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media Defined: If a Picture isn’t

worth a 1000 words, then …

15

Page 16: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media Defined

16

Online technologies and practices

for social interaction

enabling the sharing of opinions, insights,

experiences, perspectives and media itself

Page 17: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media Defined: Categories

17

Page 18: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media Defined: Unanimous Agreement

18

Marta Kagan

Page 19: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media is Huge: Users

19

200 Million: Twitter

100 Million: LinkedIn

750 Million: Facebook

Marta Kagan

Page 20: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media is Huge!

20

If Facebook

were a country,

it would be the

3rd largest in

the world

Marta Kagan

Page 21: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media Data: Research Opportunity

21

“Every day, Twitter

generates more

social network

data than the

entire field of SNA

possessed 10

years ago.”

Page 22: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media is Huge:

Usage and Content

22

Name Name

(Symbol) (Symbol)

kilobyte (kB) 3 kibibyte (KiB) 210 = 1.024 × 103

megabyte (MB) 6 mebibyte (MiB) 220 ≈ 1.049 × 106

gigabyte (GB) 9 gibibyte (GiB) 230 ≈ 1.074 × 109

terabyte (TB) 12 tebibyte (TiB) 240 ≈ 1.100 × 1012

petabyte (PB) 15 pebibyte (PiB) 250 ≈ 1.126 × 1015

exabyte (EB) 16 exbibyte (EiB) 260 ≈ 1.153 × 1018

zettabyte (ZB) 21 zebibyte (ZiB) 270 ≈ 1.181 × 1021

yottabyte (YB) 24 yobibyte (YiB) 280 ≈ 1.209 × 1024

10**N Value

Page 23: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media Data: Part of a Bigger Picture

23

Page 24: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media Data:

Ways in big data is creating value

24

• Makes information

transparent and usable at

much higher frequency.

• Provides more transactional

data in digital form, that can

be used to improve

performance across the

board.

• Allows ever-narrower

segmentation of customers to

tailor products or services.

• Improves decision-making

through sophisticated.

• Improves the development of

the next generation of

products and services

Page 25: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Data Mining: Defined

25

Discovering meaningful

patterns from large data

sets using pattern

recognition technologies.

Page 26: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

26

Data Mining: CRISP-DM

Data Consolidation

Data Transformation

Data Cleaning

Data Reduction

Well-Formed

Data

Real-World

Data

Cross-Industry Standard Process for Data Mining

Business

Understanding

Data

Understanding

Data

Preparation

Modeling

Evaluation

Deployment

26

Page 27: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Data Mining: General Data Assumptions

Structured

Transformed

Well-Formed

27

Page 28: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Data Mining: Example

28

Affinity Analysis

Page 29: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Data Mining: Example

1. Market Basket Analysis: Items for Sale:

Apples Bananas Cherries

2. Possible Transactions: With one item or a collection of items selected as

the Driver or Independent Variable

3. Objective is to empirically determine those groups of items that occur

frequently together in a set of transactions, producing a set of rules of the

form X -> Y.

No

1 A B

2 A C

3 A B C

4 B A

5 B C

6 B A C

X Y No

7 C A

8 C B

9 C A B

10 A B C

11 A C B

12 B C A

X Y

Page 30: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Data Mining: Example

30

Standard Market Basket Measures:

Support: Rule’s coverage (% match antecedents)

N(X & Y)/ N(T) Example: N(A & B)/ N(T) = 2/7 = 29%

Confidence: Rule’s predictive ability (% consequent | antecedent)

N(X & Y)/ N(X) Example: N(A & B)/ N(A) = 2/4 = 50%

Lift: Predictive improvement (ratio of observed support for X&Y to support if X& Y

independent -- S(XuY)/S(X)S(Y) Example: (2 x7)/(4/7)(5/7) = .7 or 70%

Transaction ID Items

1 Apple

1 Banana

1 Cherry

2 Apple

3 Banana

3 Cherry

4 Banana

4 Cherry

5 Apple

5 Banana

6 Apple

6 Banana

7 Apple

7 Cherry

8 Apple

8 Banana

9 Apple

9 Banana

9 Cherry

10 Apple

10 Banana

1 1 1 1

2 1 0 0

3 0 1 1

4 0 1 1

5 1 1 0

6 1 1 0

7 1 0 1

8 1 1 0

9 1 1 1

10 1 1 0

Sum 8 8 5

Page 31: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Data Mining: Example

31

Rule selection usually based

on minimum support & confidence

No N(XuY) N(T) S(XuY) N(X) Conf N(Y) S(X) S(Y) Lift Rule

1 A B 6 10 60% 8 75% 8 80% 80% 94% Ok

2 A C 3 10 30% 8 38% 5 80% 50% 94%

3 A B C 2 10 20% 8 25% 4 80% 40% 78%

4 B A 6 10 60% 8 75% 8 80% 80% 117% Ok

5 B C 4 10 40% 8 50% 5 80% 50% 125%

6 B A C 2 10 20% 8 25% 3 80% 30% 104%

7 C A 3 10 30% 5 60% 8 50% 80% 150%

8 C B 4 10 40% 5 80% 8 50% 80% 200% Ok

9 C A B 2 10 20% 5 40% 6 50% 60% 133%

10 A B C 2 10 20% 6 33% 5 60% 50% 111%

11 A C B 2 10 20% 3 67% 8 30% 80% 278%

12 B C A 2 10 20% 4 50% 8 40% 80% 156%

X Y

Parameters

Min. Support 40%

Min. Confidence 75%

Page 32: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Data Mining: Simple Example

But, what if the baskets were described in the

following manner:

– Jane bought a handful of maraschinos and a couple of

granny smiths.

– Harold purchased a bag of appls and 2 bananas.

– Bill paid for a pound of cherries but decided not to buy

the three durians because of their odor.

How could we automate the analysis?

Page 33: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media Data:

33

Page 34: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Social Media Data: Commonality?

34

Page 35: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: Defined

35

Using data mining to discover patterns

in a collection of documents

Page 36: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: CRISP-Like Processes

Business

Understanding

Document

Understanding

Document

Preparation

Modeling

Evaluation

Deployment

Document

Consolidation

Corpus Refinement

(Token, Stem, Stop…)

Establish the

Corpus

Feature Selection

& Weighting

Term-

Doc-Matrix*

Real-World

Text Data

Documents

36

Page 37: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining Process: Sample Corpa

• Brown Corpus – first million word corpus compiled in 60s at

Brown U., 500 samples across 15 genres, each ~2000 words with

POS tags (Lancaster-Oslo-Bergen Corpus – British equivalent)

• Linguistic Consortium Treebanks – collections of manually

tagged and parsed (tree structures) of sentences from a variety of

sources (includes well-known Penn Treebank collection)

• Reuters 21578, RCV1 & V2, TRC2 -- collections (1000s of)

Reuter’s English & multi-lingual news stories classified into topics and

grouped into training & test sets

• Pang & Lee’s Sentiment Analysis – 1000 positive and 1000

negative movie reviews

• MEDLINE – An extensive collection of articles and abstracts

(18M+) used in a variety of biomedical and linguistic text mining

applications

• WordNet® -- large lexical database of English grouped into sets of

cognitive synonyms (synsets) and interlinked by means of

conceptual-semantic and lexical relations.

• 20 Newsgroups -- collection of approximately 20,000 newsgroup

documents, partitioned (nearly) evenly across 20 different

newsgroups each representing a different topic.

Page 38: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining Process: Corpus Refinement

• Tokenization —Parse the text to generate terms. Sophisticated

analyzers can also extract phrases from the text.

• Normalize — Convert them to lowercase.

• Eliminate stop words — Eliminate terms that appear very often (e.g.

the, and, …).

• Stemming — Convert the terms into their stemmed form—remove

plurals and different word forms (e.g. achieve, achieves, achieved –

achiev) [note: word about synonyms – WordNet Synset]

Tokenization Normalize Eliminate

Stop Words Stemming

Common representation of tokens within and between documents

Page 39: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: Feature Extraction & Weighting

Feature

Extraction

Vector Representation ->

Word, Term, Token or Pairs-Triplets

x Doc Matrix

“Bag of Words, Terms 

or Tokens”

Words or Tokens are

attributes and documents

are examples

Token1 Token2 Token3 Token4 …

Doc1 1 2 2 4

Doc2 4 2 3 0

Doc3 1 1 1 0

Doc4 1 1 1 2

Page 40: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: Transforming Frequencies

• Binary Frequencies: tf =1 for tf>0; otherwise 0

• Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K

• Log Frequencies: 1 + log(tf) for tf>0; otherwise 0

• Normalized Frequencies: Divide each frequency by SQRT of Sum of Squares of the frequencies within the vector (column)

• Term Frequency–Inverse Document Frequency – TF * IDF

– Inverse Document Frequency: log(N/(1+D)) where N is total number of docs and D is number with term

Page 41: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: Simple Example

41

Listening Post is an art installation by Mark

Hansen and Ben Rubin that culls text

fragments in real time from thousands of

unrestricted Internet chat rooms, bulletin

boards and other public forums.

Page 42: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: Simple Example

42

Page 43: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: Simple Example

43

Blogs “I feel”

“I’m feeling”

Every

10 Mins

Contains

1 of 5000

Pre-Determined

Feelings

15-20K

Feelings

Per Day

sentence

imageid

feeling

posttime

postdate

posturl

gender

born

country

state

city

lat

lon

conditions

Page 44: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: Simple Example

44

API

http://api.wefeelfine.org

:8080/

ShowFeelings?

display=xml&

returnfields=

Sentence

&postdate=2010-11-25

&limit=500

Query Result <?xml version="1.0" ?>

<feelings>

<feeling imageid="-

mZmybPrOGTZ+xukpcU7jg"

feeling="better"

sentence="i feel almost 100 better

aside from that weird sandy feeling in

my throat"

posttime="1321633467"

postdate=2010-11-25="0"

posturl="http://jenngreenleaf.blogspot.com

/2011/11/im-coming-down-with-cold-or-

am-i.html"

gender="0" country="united states"

state="maine" city="richmond"

lat="44.091522" lon="-69.801787"

conditions="4" />

Page 45: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: Simple Example

• i'm done believing you don't know what i'm feeling

• i feel so out of place

• i'm feeling healthy

• i never feel down when i'm with her

• i love the feeling

• i feel like i've been run over by a truck

• i feel so positive today

• i feel like a poor man's pin up girl

45

Page 46: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: Simple Example

• Input String (128925 chars; 24282 spaces) – "i have found to be helpful especially during those times when i am feeling

discouraged\ni have a 50km commute and just the lack of the sense of freedom that driving brings just leaves me feeling scared\ni seem to be feeling better mostly…"

• Tokenize (26465 tokens) – ['i', ', 'have', 'found', 'to', 'be', 'helpful', 'especially', 'during', 'those', 'times', 'when', 'i',

'am', 'feeling', 'discouraged', 'i', 'have', 'a', '50km', 'commute', 'and', 'just', 'the', 'lack', 'of', 'the', 'sense', 'of', 'freedom', 'that', 'driving', 'brings', 'just', 'leaves', 'me', 'feeling', 'scared', 'i', 'feel', 'noone', 'know', 'if', 'you', 'were', 'me', 'you', 'will', 'feel', 'the', 'same', 'way‘, …]

• Set of Tokens (3045 distinct tokens) – ["'", "'believe", "'d", "'en", "'encoding", "'feedlinks", "'forever", "'gets", "'http",

"'ismobile", "'isprivate", "'item", "'languagedirection", "'ll", "'locale", "'ltr", "'m", "'mefaked", "'mobileclass", "'mr", "'no", "'okay", "'on", "'pagetitle", "'pagetype", "'re", "'s", "'t", "'toned", "'url", "'us", "'utf", "'ve", "'yes", '0', '034', '039', '0aeverytime', '0d', '10', '100', '101',…]

Page 47: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: Simple Example

Corpus Word Length Sentence Length Lexical Diversity

We Feel Fine 4 17 8

Gutenberg Corpus

Austen-persuasion.txt 4 23 16

Bible-kjv.txt 4 33 79

Blake-poems.txt 4 18 5

Carroll-alice.txt 4 16 12

Melville-moby.txt 4 24 15

Milton-paradise.txt 4 52 15

Shakespeare-caesar.txt 4 12 8

Shakespeare-hamlet.txt 4 13 7

Page 48: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: Simple Example

• Eliminate Stopwords (175 words - 'a', 'about', 'above', 'after', …) – Set of tokens (12827) with stopwords eliminated ['ab', 'abit', 'able', 'abs',

'absolute', 'absolutely', 'absorb', 'abuse', 'accomplished', 'accomplishment', 'achieve', 'achieved', 'across', 'acted', 'action', 'activities', 'activity', 'actually', 'acura', 'add', …]

– Content (11896 or 45% of tokens not stopwords – 4053 with tokens starting with apostrophes and #s eliminated )

• Stemming – Stemmed tokens (11896) ['abdomen', 'abdul', 'abil', 'abl', 'abrupt', 'absolut',

'abstract', 'academ', 'accept', 'accid', 'accomplish', 'accur', 'accus', 'accustom', 'achi', 'achiev', 'acknowledg', 'across', 'action', 'activ‘…]

– Set of tokens in stemmed content(2283) ['abdomen', 'abdul', 'abil', 'abl', 'abrupt', 'absolut', 'abstract', 'academ', 'accept', 'accid', 'accomplish', 'accur', 'accus', 'accustom', 'achi', 'achiev',…]

Page 49: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: Simple Example

Page 50: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: Simple Example

50

Sum 416 94 90 89 83 80 80 76 76 75 … 16 16 16 16 16 16 16 16 16

Sum WeFeel like know time go think better way get good love … hear didn place almost comfort everyonsinc babi actual

3 comment1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 comment2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 comment3 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

1 comment4 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

1 comment5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 comment6 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

7 comment7 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

7 comment8 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 comment9 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

6 comment10 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

… …

2 comment1490 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 comment1491 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

6 comment1492 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

3 comment1493 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4 comment1494 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4 comment1495 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 comment1496 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 comment1497 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 comment1498 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3 comment1499 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Document-Term Matrix

Page 51: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Text Mining: Simple Example

Madness Murmerings Montage

Mounds Metrics Mobs

Page 52: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction

52

Collective, macroscopic

trends which can be

scientifically inferred by

harnessing publicly

accessible data from

the Internet.

Page 53: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Characteristics

53

Public

Practical

Big

Page 54: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Sources

54

What we surf

Whom we “friend”

What we say

Where we go

What we buy

How we play

Easily accessible digital traces:

Page 55: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Sample Studies

55

Page 56: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Sample Studies

56

Infodemiology

Nowcasting

Culturomics

Page 57: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology

57

Information + Epidemiology:

Science of distribution and

determinants of information

in an electronic medium,

specifically the Internet, or

in a population, with the

ultimate aim to inform public

health and public policy

Coined by Gunther Eysenbach, Univ. of Toronto

Page 58: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology A Major Application - Practical

58

Page 59: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology A Major Application - Practical

59

Vi

Regional, Weekly Syndromic Surveillance

Page 60: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology An Alternative Approach

60

Text Mining of Worldwide Newswires, Web Sites

and Various Offline Reports

Page 61: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology Utilizing Aggregate Search Data

Monitoring and analyzing

queries from Internet search

engines or peoples' status

updates on microblogs for

syndromic surveillance to

predict disease outbreaks

61

Page 62: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology Utilizing Aggregate Search Data

62

Page 63: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology Utilizing Aggregate Search Data

63

Page 64: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology Utilizing Aggregate Search Data

64

Dependent

Variable at

Time t

(Standard

Publicly

Available

Measure)

=

Traditional,

Publicly

Available

Explanatory

Variable

Aggregate

Search

Index or

Social

Media

Freq.

Count

b0 + b1 + b2 + e

Dependent

Variable at

Time t - n

(Standard

Publicly

Available

Measure)

+ b3

Standard Linear Prediction Model

Page 65: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology Utilizing Aggregate Search Data

“Detecting Influenza Epidemics Using Search

Engine Query Data” (Ginsberg et. al.), 2/19/09

• Aggregating historical logs of search queries

from 2003-2008, computing weekly time series

• Logit(P) = b0 + b1 * logit(Q) + e

– P – percentage of ILI physician visits

– Q – query fraction 45 highest influenza queries

• r is between .80-.96 for 9 regions

65

Page 66: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology Utilizing Aggregate Search Data

66

http://www.google.org/flutrends/about/how.html

Page 67: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology Utilizing Aggregate Search Data

67

Page 68: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology A Similar Application

68

http://www.google.org/denguetrends/

Page 69: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology Utilizing Tweets

69

?

Page 70: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology Utilizing Tweets

70

Page 71: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology Utilizing Tweets

“Nowcasting Events from the Social Web

with Statistical Learning,” Lampos and

Cristianini, ACM IS&T, 9/11

• Text analysis of 50M tweets for 3 regions of UK

from 6/09-4/10 (303 days)

• HPA weekly reports of GP consultations with ILI

diagnosis correlated with number of “hybrid

grams”

• Average “r” of .911 71

Page 72: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology A Major Application – Text Analysis

72

Tokens Lower

Case

Stop

Words Stems

Corpus

Corpus

Refinement

Feature

Selection

1-

Grams

2

Grams

Hybrid

Grams

50M Tweets

3 Region UK, 6/09-4/10

N-Gram

Freqs

Page 73: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology Utilizing Tweets

73

Discarded

when

n<50

BoLasso - Bootstrap LASSO (least absolute shrinkage and selection operator

Page 74: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology Utilizing Tweets

74

Page 75: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Infodemiology Utilizing Tweets

75

Page 76: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction:

76

Now + Forecasting:

Predicting the present

by analyzing large

volumes of data that

can be used to

"forecast" current

events for which

official analysis has

not been released

Page 77: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Nowcasting Weather Envy

77

Within the next 6 hours …

Page 78: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Sample Studies with Search

78

Authors Date (Mnth-Year) Dependent Variables Explanatory Variables Model Results

Song, Pan, Ng Apr-10 Weekly Hotel Bookings in

Charleston, SC

Indexed Search Volumes from

Google Trends/Insights Jan

2008-Aug 2009

Log of Room Nights for Log of Search

Volumes - Charleston, Travel Charleston,

Charleston Hotels, Charleston

Restaurants, Charleston Tourism

Test various statistical models; all gave

reasonable forecasts. Best fit model

was Autoregressive Distributed Lag

(ADLM) with a lag period of 6 weeks.

Kholodilin,

Podstawski,

Sliliverstovs

Apr-10 Year-on-Year Growth Rate

of Monthly US Real

Private Consumption,

ALFRED db of Fed Rsrv of

St. Louis

220 Google Trend/Insights

Search terms related to Priv

Consumption reduced to 10

principal components for

montly periods from Jan 2005

to Dec 2009

Y-o-Y monthly URPC growth rates for 3

sets of regressors -- Sentiment

(consumer sentiment and confidence);

Financial (short term and long term

interest rates and S&P 500); Query

(combinations of principal components of

query terms)

Query term principal components

outperform standard Sentiment and

Financial Indicators. A combination of

two of the factors work best -- those

related to mobility and health care

consumption.

Choi, Varian Apr-09 US Census Bureau

Advance Monthly Retail

Sales (general and

specific) and Travel

(Visitor arrival in Hong

Kong)

Google Trend/Insight query

indices for categories and

subcategories related to retail

sales (general and specifix)

and related to Travel

Google Trend indices for query

subcategories related to (log values) of

overall monthly retail trade (NAICS

categories), automotive sales, home

sales and travel.

Simple seasonal AR models and fixed-

effects models that includes relevant

Google Trend variables tend to

outperform models that exclude these

variables. In some cases small gains, in

other substantial.

McLaren,

Shanbhogue

Q2-11 Official monthly

unemployment data and

housing price growth in

the UK from June 2004-Jan

2011

Google Trend/Insight query

indexes for the term "Job

Seekers Allowance (JSA)" for

unemployment and "Estate

Agents" for housing

For unemployment, linear AR model

with query term, claimant count, and GfK

consumer confid. as exp vars; for housing

price growth with query term, Home

Builders and Royal Instit. of Chartered

Surveyors price growth balances as exp

vars.

For unemployment forecasts, claimant

count strongest followed by query term.

For housing prices, the query term was

much stronger than HBF and RICS data.

Page 79: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Sample Studies with Social Media

79

Authors Date (Mnth-Year) Dependent

Variables

Explanatory Variables Model Results

Asur,

Huberman

Mar-10 Box-office

revenues for (24)

movies

Promotion tweets-retweets for a particular movie,

tweet rates for particular movie per hour, ratio of

positive to negative sentiments for the movie

Regression of 1st weekend box

office revenues by promotional

tweets-retweets, by tweet rates

vs. Hollywood Stock Exchange

prices, and 2nd weekend

revenues by tweet rates and the

sentiment ratio.

Promotional tweets are weakly

correlated 1st weekend revs. Tweet

rates are very strongly correlated

(min .9) and a stronger predictor than

HSX. Finally, tweet rates are strongly

correlated with 2nd weekend

revenues and sentiments improve

the forecasts slightly.

Gruhl, Guha,

Kumar, Novak,

Tomkins

Aug-05 Amazon Sales

Rank for 2340

bestselling books

in 4 month period

(Jul 2004-Aug

2004) and spikes

in these sales

ranks

Number of mentions of the book/author in over 300K

blogs whose postings that were maintained by IBM's

WebFountain project (over 200K postings/day)

Cross correlation of time series

for sales rank and mentions.

While sales rank is a poor predictor of

the change in sales rankings, a prior

spike in mentions predicts quite well

a future spike in sales rank.

Sadikov,

Parameswaran,

Venetis

Aug-09 Movie critic

ranking, user

ranking, 2008

gross sales,

weekly box office

sales (weeks 1-5)

Basic features that count movie references in blogs,

count movie references taking into account ranking

and indegree of the blogs where they appear,

consider only references made within a time window

before or after a movie release date, features that

consider positive sentiment; and combinations of

these. References based on spinn3r.com blog data

set 11/07-11/08

Linear regression for weekly

rankings and sales data by blog

references and sentiment.

Minimal correlation between

rankings and references and

sentiment. Strong correlation

between references and gross sales

but week with sentiment. Strongest

relationships with timing of

references in weeks after release.

Page 80: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Any Guesses?

80

Page 81: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Idiom, a Sculpture of 10s of 1000s of Books

81

Page 82: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: It comes in many Shapes but not Sizes

82

Omphalos

Gravity Mixer

Book Cell

Matej Krén

Page 83: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Culturnomics

83

Culture + Genomics:

Application of high-

throughput data

collection and analysis

to the study of human

culture.

Page 84: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Culturomics

84

“Quantitative Analysis of Culture Using

Millions of Digitized Books,” Science, 12/16/10.

Page 85: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Culturomics 2.0

85

http://www.youtube.com/watch?v=61qn7S9NCOs

Page 86: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Culturomics 2.0

• The tone of real-time consciousness reflected in the media can

be used to forecast broad social behavior.

• Combined three massive news archives totaling more than 100

million articles worldwide to explore the global consciousness

of the news media.

• Employs a large shared-memory supercomputer (University of

Tennessee SGI Altix supercomputer Nautilus with 1024

processors and 4-TB of memory)

• Using the tone and location of the reports, (claims to have)

predicted the outcome of the Arab Spring and the location of Bin

Laden within radius of 125 miles

86

Culturomics 2.0: Forecasting Large-Scale Human

Behavior Using Global News Media Tone in Time

and Space, Kalev Leetaru, 9/11

Page 87: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Culturomics 2.0 Based on Carbon Capture Report

87

Page 88: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Culturomics 2.0 Based on Carbon Capture Report

88

Page 89: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Culturomics 2.0 Features of Stories or Tweets

• Tone/Positivity/Negativity. Ratio of + to - tone (-

100 to 100)

• Polarity. Emotional charge (0 to 100)

• Activity. Intensity of "active language" (0 to 100)

• Personalization. Degree to which the writer

attempts to bring the reader into the fold (0 to

100)

• Questions/Exclamations. Tweet tone indicators of

non-word items

• Geocoding. Location of story content

89

Page 90: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Culturomics 2.0 Features of Stories or Tweets

90

100M Articles from the:

New York Times (1945-05)

Sum. of Wrld Brdcasts (1979-10)

Google News articles (2006-11)

Sentiment Mining,

Geocoding,

Entity Extraction

Nautilus Supercomputer Feature Scores

Geocoding

2.4 Petabyte

Network with over

10M entitles

Page 91: Mining and analyzing social media hicss 45 tutorial – part 1

Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL

Prediction: Culturomics 2.0 Predicting Unrest

91