40
Studying Blogspace Studying Blogspace Ravi Kumar Ravi Kumar IBM Almaden Research Center IBM Almaden Research Center [email protected] [email protected]

Studying Blogspace Ravi Kumar IBM Almaden Research Center [email protected]

  • View
    226

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Studying BlogspaceStudying Blogspace

Ravi KumarRavi Kumar

IBM Almaden Research CenterIBM Almaden Research Center

[email protected]@almaden.ibm.com

Page 2: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

EtymologyEtymologyFrom the OED new ed. (draft entry, Mar 2003) …

blog intr. To write or maintain a weblog. Also: to read or browse through weblogs, esp. habitually.

web¢log n. 2. A frequently updated web site consisting of personal observations, excerpts from other sources, etc., typically run by a single person, and usually with hyperlinks to other sites; an online journal or diary.

From WWW 2003 (Kumar, Novak, Raghavan, Tomkins) …blog¢space n. The collection of weblogs; =

blogosphere, blogsphere, blogistan, …

Page 3: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Blogs 101Blogs 101

• Characteristics– Pages with reverse chronological sequences of dated

entries– Usually contain a persistent sidebar containing profile

(and other blogs read by the author – “blogroll”)– Usually maintained and published by one of the

common variants of public-domain blog software• From Slashdot, 1999

“… a new, personal, and determinedly non-hostile evolution of the electric community. They are also the freshest example of how people use the Net to make their own, radically different new media”

Page 4: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Look and feelLook and feel

• Quirky• Highly personal• Consumed by a small number of regular repeat

visitors• Often updated multiple times each day• Highly interwoven into a network of small but

active micro-communities

• Eg: LiveJournal, Xanga, DeadJournal, Blogger, Memepool, …

Page 5: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

The blog eraThe blog era

• Blogs began in 1996, but exploded in popularity in 1999– Proliferation of authoring tools

• Newsweek 2002 estimates ~500K • LiveJournal 2005 estimates ~3.5M • Annual Blogathon for charity

– Bloggers update their Blogs every 30m for 24h– Sponsors pay …

• Impact of blogs– “Miserable failure” on Google

Page 6: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Structural studyStructural study(Kumar, Novak, Raghavan, Tomkins, CACM 2004)(Kumar, Novak, Raghavan, Tomkins, CACM 2004)

Page 7: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Livejournal blogspaceLivejournal blogspace

• Livejournal.com: popular blog site

• 1.3M bloggers (Feb 2004)

• 3.5M bloggers (Apr 2005)

• Each blogger has a profile– Name, age, …– Geographic information (city, state, zip, …)– Friends and friend of– Interests/communities

Page 8: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Eg, LiveJournal user “bill”Eg, LiveJournal user “bill”

Page 9: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

LJ bloggers in USLJ bloggers in US

< 1K< 5K< 10K< 25K< 50K~ 100K

Page 10: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

LJ bloggers world-wideLJ bloggers world-wide

< 1K< 2K< 5K~ 25K~ 50K~ 75K

Page 11: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Who are they?Who are they?Age % Representative interests

Page 12: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Friendship graphFriendship graph• Directed• 80% mutual• Average degree ~ 14• Power law degrees• Clustering coeff. ~ 0.2• Most friendships

explained by age, location, interest

Age 1%

Location20% Interest

16%

5%

16%

22%

Page 13: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Evolutionary studyEvolutionary study (Kumar, Novak, Raghavan, Tomkins, WWW 2003)(Kumar, Novak, Raghavan, Tomkins, WWW 2003)

Page 14: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Blogs and evolutionBlogs and evolution

• Every blog contains a dated record of– Every word ever written to the blog– Every link ever added in the blog

• Blogs are an increasingly important medium, but– Few systematic studies have been performed– Such study should take an evolutionary perspective

[Brewington et al] [Bharat et al] [Fetterly et al] [Cho et al]

– Tools for understanding evolution not fully understood

Page 15: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Time graphsTime graphsv1 v2

v3

v4

time

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

v1 v2

v3 v4

Underlying graph Time graph

Page 16: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Community evolution in blogsCommunity evolution in blogs

• What are the communities within the time graph? – Community definition, extraction– Graph-based methods (trawling)

[Kumar Raghavan Rajagopalan Tomkins, WWW 99]

• How active are these communities, and over what timeframe?– Burst analysis [Kleinberg, KDD 02]

Page 17: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Community extractionCommunity extraction• Community analysis based

on graph structure• Idea: there are many

subgraphs that would never occur in a random graph – if we find such subgraphs, there must be some reason

• In blogspace, we enumerate dense subgraphs using a greedy heuristic

Page 18: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Dense subgraph enumerationDense subgraph enumeration(heuristic)(heuristic)

• Scan edges, find triangles• For each triangle, greedily grow its neighbor set• Growth is allowable based on a measure of

connectivity to the current dense subgraph• Extracted “communities” are not unique

Current size (N) 2 <=6 <=9 <=20 >20

Must connect to 2 N-1 N-2 0.7N 0.6N

Page 19: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Bursts: Static to dynamic Bursts: Static to dynamic communitiescommunities

• Phenomenon to characterize: A topic in a temporal stream occurs in a “burst of activity”

• Model source as multi-state• Each state has certain emission properties• Traversal between states is controlled by a

Markov model• Determine most likely underlying state

sequence over time, given observable output

Page 20: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

An exampleAn example

Time

I’ve been thinking about your idea with

the asparagus…

Uh huh I think I see…

Uh huh Yeah, that’s what I’m saying…

So then I said “Hey, let’s give

it a try”

And anyway she said

maybe, okay?

0.0051 2

0.01State 1:Output rate: very low

State 2:Output rate: very high

1 1 1 1 2 2 2

Most likely “hidden” sequence

Page 21: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Some experimentsSome experiments

• Crawled 24,109 blogs from popular sites (2003)• Extract archive links from blogs• Extract all dates on blog pages, and tag each word

and link with a date

– Simple heuristics to automatically extract time-stamps

from entries (regular expressions, training, …)

– Obtained dates for ~90% of edges

Page 22: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Experiments (contd.)Experiments (contd.)

• The time graph– 22,299 nodes, 70,472 unique edges– 0.77M multiedges (average edge multiplicity = 11)

• Consider graphs formed by prefixes from Jan 1, 1999 to some later month – generate 47 “prefix graphs” for analysis

• Enumerate communities and analyze their burstiness

Page 23: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

SCC growthSCC growth

Largest SCC as fraction of all nodes

2nd and 3rd largest SCCs as fraction of all nodes

Page 24: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Connectivity in BlogspaceConnectivity in Blogspace

Fraction of nodes participatingIn some community

Number of communities

Number of nodes participating in a community

Page 25: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Burstiness of communitiesBurstiness of communities

Number of communities in “high state” during each time period

Page 26: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Are these results fluke?Are these results fluke?

• “Randomized Blogspace”: A distribution over time graphs that look very much like the time graph of Blogspace, but remove some of the locality of the true graph

• Vertices and edges arrive at the same times, each edge has the same source, but a randomly-chosen destination

• If randomized blogspace behaves like blogspace, then community structure is a fake

Page 27: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

SCC evolutionSCC evolution

Blogspace

Randomized Blogspace

Randomized Blogspace formsan SCC much earlier

Page 28: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Community Community evolutionevolution

Blogspace

Randomized Blogspace

Blogspace has manymore communities

Page 29: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Exogenous eventsExogenous events

Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace

Number of blog pages that belong to a community

Number of blog communities

Wired magazine publishes an article on weblogs that impacts the tech community

NewsWeek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption

Page 30: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Some questions …Some questions …

• Modeling– Edge arrivals– `Interesting’ events

• Algorithms– Prediction– Information percolation – Search– o(t ¢ T(n))

• Studies– Sociological – Effect on search and ranking

Page 31: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Prediction via blogsPrediction via blogs (Gruhl, Guha, Kumar, Novak, Tomkins, 2005)(Gruhl, Guha, Kumar, Novak, Tomkins, 2005)

Page 32: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Blogs as trend indicatorsBlogs as trend indicators

• Can blogs be used to predict trends?

• Data– Amazon sales rank of some books– Blog chatter in an index

• Questions– How well do they correlate?– Can sales rank be predicted using blogs?

Page 33: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

The Lance Armstrong Performance The Lance Armstrong Performance ProgramProgram

Query: Lance ArmstrongOR Tour de France

Page 34: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Vanity FairVanity Fair

Page 35: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Cross-correlation for Lance Cross-correlation for Lance ArmstrongArmstrong

Page 36: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Simple inferencesSimple inferences

• How to formulate queries automatically– Depends on the object (book, movie, …)– Simple heuristics work well

• Predicting sales motion is hard

• Predicting spikes appears relatively easier

• More to be done …

Page 37: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Blogs and social networksBlogs and social networks (Kumar, Liben-Nowell, Novak, Raghavan, Tomkins, 2005)(Kumar, Liben-Nowell, Novak, Raghavan, Tomkins, 2005)

Page 38: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

Social networksSocial networks

• Blog friendship graph is a social network

• Is there a simple model to describe this network?

• Desiderata– Fit experimental observations– Exhibit “six-degrees of separation”– Theoretically tractable

Page 39: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

RBF: RBF: Rank-Based FriendshipRank-Based Friendship

• Population network model• Each person has a geographic location• d(¢, ¢) = measures geographic distance• rankA(B) = #{ C : d(A, C) < d(A, B) }• Pr[A “befriends” B] / 1/rankA(B)

– Independent of distance– Works with arbitrary population densities

• Plus local links to neighbors

Page 40: Studying Blogspace Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com

RBF: Preliminary resultsRBF: Preliminary results

• Fits LiveJournal friendship experimental graph data (using geo data in the profile)

• Greedy routing: Is able to route messages from source to destination most of the time, just using geographic information

• Theoretical analysis: Can show that this model guarantees geographic routing to work