Idenfying Relevant Social Media Content: Leveraging ... Relevant Social Media Content: Leveraging Informaon Diversity and User Cognion Munmun De Choudhury1, Sco Counts 2 & Mary Czerwinski2

Iden%fying Relevant Social Media Content: Leveraging Informa%on Diversity and User Cogni%on

Munmun De Choudhury1, Sco% Counts2 & Mary Czerwinski2 1Rutgers, The State University of New Jersey

2Microso< Research, Redmond

6/8/11 2

Modern Social Interac%onal Modes

Facebook Slashdot

Engadget

Flickr

LiveJournal Digg

YouTube Blogger

MetaFilter Reddit MySpace

Orkut TwiIer

140 characters can cause revoluEons

During the elec%ons in Iran

And during the earthquake in Hai%

However the social web is changing at a fast rate

And what exactly is changing?

New people appear

New Ees are formed

New interacEonal data appears too!

{pink, story, design}

{visualizaEon, Environment}

We are aQracted to social media, in part due to large scale datasets

Is there something more fundamental happening here than just scale?

This talk is about selecEng

content that ma%ers

“InformaEon overload” problem – Get me the right content!

6/8/11

14

How do we idenEfy the most “relevant” or “’best” items on a topic, from millions and even billions of units of social media content?

Let’s contrast this with a familiar example

Discrete, regular and fixed sampling la\ce

• Shannon‐Nyquist sampling theorem: “If a funcEon x(t) contains no frequencies higher than B hertz, it is completely determined by giving its ordinates at a series of points spaced 1/(2B) seconds apart.”

Time to sample each pixel is constant

Note that the web acEvity has no noEon of bandwidth!

Interfaces / tools #Responses

TwiQer website 50

TwiQer clients, such as Tweetdeck, TwiQerific etc. 25

Search engines, such as Bing Social 19

Third party apps, such as TwiQer plugin for Google

9

Uni‐dimensional informaEon presentaEon; but social media informaEon is diverse.

CharacterisEcs of social media – high dimensionality

Geo

graphy

Authority

ConversaEonal nature ThemaEc category

InformaEon Diversity InformaEon Diversity

[Simon 1971, Zaichkowsky 1985, Jost 2006]

Also, social media content selecEon needs to benefit from mechanisms of human cogniEon

“Goodness of a set” – using measures of human informaEon processing

Engagement Memory encoding InteresEngness InformaEveness

How do we select such content that matches a certain degree of informaEon diversity?

Dimensional Importance

•  Survey based feedback on the importance of different dimensions – referred to as “concentraEon parameters”. –  ParEcipants (11 ‘acEve’ TwiQer users) were requested to rate each of the

tweet dimensions on a scale of 1 through 7, where 1 implied “not important at all”, and 7 meant “highly important”.

–  The survey also allowed them to idenEfy other dimensions that they might think to be significant.

6/8/11 25

Social Media Content Selec%on

•  Every tweet ti represented as a vector of its dimensions and their corresponding weights.

•  We propose an iteraEve clustering for tweet set generaEon – based on entropy distorEon minimizaEon technique. –  The sets are constructed given a sampling raEo ρ and a diversity parameter value ω.

–  The (sub)‐opEmal set to be constructed is represented as, ΨS*(ρ ,ω).

•  Start with a random tweet as a seed. •  IteraEvely keep on adding tweets from ΨS, say ti, such that the

distorEon (in terms of L1‐norm) of entropy of the set (say,ΨS(i,ω)) on addiEon of the tweet ti is least with respect to the specified diversity measure ω.

6/8/11 26

How does this method compare to state‐of‐the‐art techniques?

6/8/11 27

TwiQer, Firehose, June 2010, total 1.4 Billion tweets

Quan%ta%ve evalua%on framework

•  We defined a set of baseline techniques using simplified version of our proposed algorithm: –  Random set (B1)

–  Randomly sampled diversity level (B2) –  Equal weighEng of tweet dimensions (B3)

–  Another two methods were used: “most recent” tweets (MR) and “most tweeted URL” (MTU) meaning the tweets corresponding to URLs that were highly shared in the network

6/8/11 28

SubjecEve EvaluaEon

6/8/11 29

Cogni%ve metrics

•  Explicit Measures. Explicit measures consisted of three 7‐point Likert scale raEngs made a<er reading each tweet set, –  “interesEngness” –  “informaEveness”

•  Implicit Measures. –  CogniEve Engagement [Czerwinski 2001] – ideally if the informaEon

presented in a tweet sample is very engaging, the parEcipant would underesEmate the Eme taken to go through it.

–  RecogniEon Memory for tweets already shown – related to encoding in the long‐term memory [Sperling 1973, Smith 1979].

6/8/11 30

31 User Study…

6/8/11

32 User Study…

6/8/11

33 User Study…

6/8/11

Hypothesis I. Tweet sets generated by proposed method will be beQer than those from baseline methods.

6/8/11 34

Performance Evalua%on

6/8/11 35

Performance Evalua%on (Contd.)

6/8/11 36

Interac%ons Interes%ngness Informa%veness Cogni%ve engagement

Recogni%on Memory

B1 X PM 0.002 0.009 0.007 0.097

B2 X PM 0.027 0.117 0.011 0.105

B3 X PM 0.241 0.351 0.138 0.411

MR X PM 0.0003 <0.0001 0.003 0.005

MTU X PM 0.061 0.171 0.004 0.214

TesEng for staEsEcal significance: one‐tail paired t‐test; confidence level p<0.1.

Hypothesis II: ParEcipants will perceive the diversity of sets by our method more accurately than by baselines.

6/8/11 37

Diversity Percep%on

6/8/11 38

B1 B2 B3 Proposed Method

d’ Error d’ Error d’ Error d’ Error

ω = 0.1 2.8 20.6% 2.2 11.1% 2.1 8.8% 1.1 7.8%

ω = 0.6 1.7 47.5% 2.9 28.1% 3.3 20.8% 5.4 13.6%

ω = 0.9 5.1 20.6% 5.5 14.6% 6.1 9.5% 6.8 7.3%

Perceived diversity is more accurate for highly heterogeneous and highly homogenous Tweet samples. Diversity percepEon is beQer for our proposed method.

Hypothesis III: ParEcipants responses will be affected by the level of diversity in the various tweet sets shown

6/8/11 39

Impact of Diversity on User Response

6/8/11 40

ParEcipant raEngs on different cogniEve aspects of informaEon consumpEon seems to be higher for highly homogenous and highly heterogeneous informaEon samples

Conclusions

•  Content selecEon methodologies of large social spaces that incorporate cogniEve metrics of content consumpEon can enable the design of beQer content exploraEon interfaces. –  InformaEon diversity is key

–  User appear to cogniEvely encode informaEon beQer, when presented with samples of high or low diversity

6/8/11 41

Open QuesEons

6/8/11 42

Are there empirical bounds on what degrees of diversity in a sample best suit content consumpEon?

6/8/11 43

Does the informaEon space seem to exhibit entropy signatures?

6/8/11 44

If so, can these entropy signatures guide the content selecEon methodology more adequately?

6/8/11 45

QuesEons?

For details: [email protected] Web: hQp://www.public.asu.edu/~mdechoud/ TwiQer: @munmun10

Appendix

Qualita%ve evalua%on

@Paramedic_Fla Some oil spill events from Monday, June 7, 2010 hQp://bit.ly/cRwfXn

@miamiauto Some oil spill events from Monday, June 7, 2010: A summary of events on Monday, June 7, Day 48 of the Gulf of Mexi... hQp://bit.ly/9HNG9Z

@franklanguage RT @DAYLEE F@CK that! Broken pipe is not NATURAL! RT @RayBeckermanFreedomWorks CEO, Calls Oil Spill Natural Disaster hQp://bit.ly/coUY4l

@Teasdallqrb Public offers 'helpful' ideas on containing BP oil spill ‐ NEWS.com.au

@_paigenesss RT @TEDchris: A Gulf oil spill picture I will never forget. hQp://twitpic.com/1toz8a

@LeiaOfAlderaan CiEzen Speaks The Truth ON BP Gulf Oil Spill‐‐the Govt, BP Are Doing Nothing, There Are No Leaders Here hQp://bit.ly/BP‐Gulf‐Oil‐Spill

@FausEnagwlxo WOOW! NO WAY! so brutal! hQp://ilil.me/h MTV Movie Summer Jam WWDC Oil Spill XEna Another Cinderella Story

@minxdeluxe RT @OliBarreQ: Visualizing the BP Oil Spill hQp://www.ifitwasmyhome.com/

[TwiIer search‐alike] Most Recent tweets [Bing‐alike] Most tweeted URL‐containing tweets

@JosephAGallant Erin Brockovich to meet with fishermen who say oil spill dispersant used by BP made them sick. hQp://huff.to/aGVWIl #tcot #BP #oilspill

@dixie_patriot Oil spill cap catching about 10,000 barrels a day|LONDON ? BP's oil spill cap, designed to stop a huge leak from .. hQp://oohja.com/xeWhD

@MoCuad My heart breaks all over again, every Eme I'm reminded of the oil spill.

@NFGNL Looking for Liability in BP's Gulf Oil Spill: White Collar Watch examines the potenEal criminal and civil liab.. hQp://nyE.ms/9lUMaT

@jameelee How You Can Volunteer to Clean Up the Gulf of Mexico Oil Spill hQp://ow.ly/1V3cu

@conchkid Gulf;Oil Spill Many Federal Judges Have Links To Oil Industry hQp://bit.ly/9v45UT

@NewsOnGreen BP Oil Spill: Containment Cap To Be Replaced Next Month hQp://dlvr.it/1WDZ8

@TrinitySaveNeo CiEzen Speaks The Truth ON BP Gulf Oil Spill‐‐the Govt, BP Are Doing Nothing, There Are No Leaders Here hQp://bit.ly/BP‐Gulf‐Oil‐Spill

Proposed Method (user‐weighted; ω=0.1; ordered) Proposed Method (user‐weighted; ω=0.6; ordered) 48 6/8/11

Documents

Idenfying Relevant Social Media Content: Leveraging ... Relevant Social Media Content: Leveraging Informaon Diversity and User Cognion Munmun De Choudhury1, Sco Counts 2 & Mary Czerwinski2