Upload
raymond-terry
View
218
Download
2
Tags:
Embed Size (px)
Citation preview
Talk outline
Information flow through blogs
Information flow through email
Search through email networks
Search within the enterprise
Search in an online community
Implicit Structure and Dynamics of BlogSpaceEytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose
• Blog use:– Record real-world and virtual experiences– Note and discuss things “seen” on the net
• Blog structure: blog-to-blog linking
• Use + Structure– Great to track “memes” (catchy ideas)
Approaches and uses of blog analysis
• Patterns of information flow– How does the popularity of a topic evolve over time?– Who is getting information from whom?
• Ranking algorithms that take advantage of transmission patterns
Pop
ula
rity
Time
Slashdot Effect
BoingBoing Effect
Tracking popularity over time
Blogdex, BlogPulse, etc. track the most popular links/phrases of the day
Different kinds of information have differentpopularity profiles
Products, etc.
Major-news site (editorial content) – back of the paper
5 10 15 5 10 155 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 10 15
% of hits received on each day since first appearance
Slashdotpostings
Front-pagenews
Microscale Dynamics
• What do we need track specific info ‘epidemics’?– Timings– Underlying network
b1b1
Time of infectiont0 t1
b2b2
b3b3
Microscale Dynamics
• Challenges– Root may be unknown– Multiple possible paths– Uncrawled space, alternate media (email, voice)– No links
b1b1
Time of infectiont0 t1
b2b2
b3b3
??
bnbn
Microscale Dynamics who is getting info from whom
• Explicit blog to blog links (easy)– Via links are even better
• Implicit/Inferred transfer (harder)– Use ML algorithm for link inference problem
• Support Vector Machine (SVM)• Logistic Regression
– What we can use• Full text• Blogs in common• Links in common• History of infection
Visualization
• Zoomgraph tool– Using GraphViz (by AT&T) layouts
• Simple algorithm– If single, explicit link exists, draw it– Otherwise use ML algorithm
• Pick the most likely explicit link• Pick the most likely possible link
• Tool lets you zoom around space, control threshold, link types, etc.
http://www-idl.hpl.hp.com/blogstuff
iRank
Find early sources of good informationusing inferred information paths or timing
b1b1
b2b2
b3b3 b4b4 b5b5 bnbn…
True source
Popular site
iRank Algorithm
• Draw a weighted edge for all pairs of blogs that cite the same URL• higher weight for mentions closer together• run PageRank• control for ‘spam’
Time of infectiont0 t1
Do Bloggers Kill Kittens?
02:00 AM Friday Mar. 05, 2004 PST Wired publishes:
"Warning: Blogs Can Be Infectious.”
7:25 AM Friday Mar. 05, 2004 PST Slashdot posts:
"Bloggers' Plagiarism Scientifically Proven"
9:55 AM Friday Mar. 05, 2004 PST Metafilter announces
"A good amount of bloggers are outright thieves."
co-worker
co-worker
co-worker
mike
mom
collegefriend
Spread of disease is affected by the underlying network
co-worker
co-worker
co-worker
mike
mom
collegefriend
Spread of computer virusesis affected by the underlying network
Viruses (computer and otherwise) are sharedindiscriminately (involuntarily)
Information is passed selectively from one host to another based on knowledge of the recipient’s interests
Difference between information flow and disease/virus spread
co-worker
co-worker
co-worker
mike
mom
collegefriend
Spread of information is affected by its content, potential recipients,and network topology
0 5 10 15 200
0.2
0.4
0.6
0.8
1
1.2
distance between personal homepages
aver
age
sim
ilarit
y at
the
dist
ance
homophily: individuals with like interests associate with one another
personal homepages at Stanford
distance between personal homepages
The Model:Decay in transmission probability as a function of the distance m between potential target and originating node
T(m) = (m+1)- T
m=0
m=1
m=2
power-law implies slowest decay
Degree distribution of all senders of email passing through the HP email server
outdegree k
/)( keCkkP
Virus, information transmission on a scale free network
100
101
102
103
104
10-8
10-6
10-4
10-2
100
outdegree
freq
uenc
y
outdegree distribution = 2.0 fit
P(k
)
1 1.5 2 2.5 3 3.5 40
0.2
0.4
0.6
0.8
1
criti
cal t
hre
sho
ld =, =0=100, =0=100, =1
106 nodes, epidemic if 1% (104) infected
Pastor-Satorras & Vespignani (2001)
epidemics on scale free graphs
Newman (2002)
Wu et al. (2004)
40 participants (30 within HPL, 10 elsewhere in HP & other orgs)
6370 URLs and 3401 attachments crypotgraphically hashed
Question: How many recipients in our sample did each item reach?
caveats:messages are deleted (still, the median number of messages > 2000)non-uniform sample
Study of the spread of URLs and attachments
100
101
100
101
102
103
104
number of recipients
num
ber
of i
tem
s w
ith s
o m
any
reci
pien
ts
email attachments
x-4.1
URLs
x-3.6
short term expensecontrol
ads at thebottom ofhotmail &yahoomessages
average = 1.1 for attachments, and 1.2 for URLs
Results
02/19/2003 15:45:33 I-1 I-2
02/19/2003 15:45:33 I-1 I-3
02/19/2003 15:45:40 E-1 I-4
02/19/2003 15:45:52 I-5 E-2
02/19/2003 15:45:55 E-3 I-6
02/19/2003 15:45:58 I-7 I-8
02/19/2003 15:46:00 E-4 I-9
02/19/2003 15:46:05 I-10 I-11
02/19/2003 15:46:10 I-12 I-13
02/19/2003 15:46:10 I-12 I-14
02/19/2003 15:46:10 I-12 I-15
02/19/2003 15:46:14 I-16 E-5
. . . . . . . .
Simulate transmission on email log
each message has a probability p of transmitting information from an infected individual to the recipient
internalnode
externalnode
Simulation of information transmission onthe actual HP Labs email graph
an individual is infected if they receive a particular pieceof information
individuals remain infected for 24 hours
start by infecting one individual at random
every time an infected individual sends an email they havea probability p of infecting the recipient
track epidemic over the course of a week, most run theircourse in 1-2 days
Introduce a decay in the transmission probabilitybased on the hierarchical distance
75.10
hpp
distance 1 distance 2distance 2
distance 1A B
hAB = 5
7119 potential recipients
0 0.2 0.4 0.6 0.8 10
500
1000
1500
2000
2500
probability of transmission
ave
rag
e s
ize
of o
utb
rea
k o
r e
pid
em
ic outbreak w/ decayepidemic w/ decayoutbreak w/o decayepidemic w/o decay
p0
Conclusions on info flow in social groups
Information spread typically does not reach epidemic proportions
Information is passed on to individuals with matching properties
The likelihood that properties match decreases with distancefrom the source
Model gives a finite threshold
Results are consistent with observed URL & attachment frequenciesin a sample
Simulations following real email patterns also consistent
NE
MA
Milgram’s experiment:
Given a target individual and a particular property, pass the message to a person you correspond with who is “closest” to the target.
How to search in a small world
Small world experiment at Columbia
Dodds, Muhamad, Watts, Science 301, (2003)
email experiement conducted in 200218 targets in 13 different countries
24,163 message chains 384 reached their targetsaverage path length 4.0
Why study small world phenomena?
Curiosity:Why is the world small?How are people able to route messages?
Social Networking as a Business:Friendster, Orkut, MySpaceLinkedIn, Spoke, VisiblePath
Six degrees of separation - to be expected
Pool and Kochen (1978) - average person has 500-1500 acquaintances
Ignoring clustering, other redundancy …
~ 103 first neighbors, 106 second neighbors, 109 third neighbors
But networks are clustered:my friends’ friends tend to be my friends
Watts & Strogatz (1998) - a few random links in an otherwise clustered graph give an average shortest path close to that of a random graph
How to choose among hundreds of acquaintances?
Strategy:Simple greedy algorithm - each participant chooses correspondentwho is closest to target with respect to the given property
Models
geographyKleinberg (2000)
hierarchical groupsWatts, Dodds, Newman (2001), Kleinberg(2001)
high degree nodesAdamic, Puniyani, Lukose, Huberman (2001), Newman(2003)
But how are people are able to find short paths?
Kleinberg (2000)
nodes are placed on a lattice andconnect to nearest neighbors
additional links placed with f(d)~ d(u,v)-r
if r = 2, can search in polylog (< (logN)2) time
Spatial search
“The geographic movement of the [message] from Nebraska to Massachusetts is striking. There is a progressive closing in on the target area as each new person is added to the chain”
S.Milgram ‘The small world problem’, Psychology Today 1,61,1967
Kleinberg: searching hierarchical structures
‘Small-World Phenomena and the Dynamics of Information’, NIPS 14, 2001
Hierarchical network models:h is the distance between two individuals in hierarchywith branching b
f(h) ~ b-h
If = 1, can search in O(log n) steps
Group structure models:
q = size of smallest group that two individuals belong to
f(q) ~ q-
If = 1, can achieve in O(log n) steps
Identity and search in social networksWatts, Dodds, Newman (2001)
individuals belong to hierarchically nested groups
multiple independent hierarchies coexist
pij ~ exp(- x)
Identity and search in social networksWatts, Dodds, Newman (2001)
There is an attrition rate rNetwork is ‘searchable’ if a fraction q of messages reach the target
N=102400
N=409600
N=204800
Mary
Bob
Jane
Who couldintroduce me toRichard Gere?
High degree search
Adamic et al. Phys. Rev. E, 64 46135 (2001)
101
102
103
104
105
100
101
102
103
size of graph
cove
rtim
e fo
r h
alf
the
no
des
random walk = 0.37 fit
degree sequence = 0.24 fit
Scaling of search time with size of graphSharp cutoff at k~N1/2nd degree neighbors
Use a well defined network:HP Labs email correspondence over 3.5 months
Edges are between individuals who sent at least 6 email messages each way
Node properties specified:degreegeographical locationposition in organizational hierarchy
Can greedy strategies work?
Testing the models on social networks(w/ Eytan Adar)
100
101
102
103
104
10-8
10-6
10-4
10-2
100
outdegree
freq
uenc
y
outdegree distribution = 2.0 fit
Degree distribution of all senders of email passing through the HP email server
Strategy 1: High degree search
outdegree
Filtered network (6 messages sent each way)
0 20 40 60 800
5
10
15
20
25
30
35
number of email correspondents, k
p(k
)
0 20 40 60 8010
-4
10-2
100
k
p(k
)
450 usersmedian degree = 10
mean degree = 13
average shortest path = 3
High degree searchperformance (poor):median # steps = 16mean = 40
Degree distribution no longer power-law, but Poisson
1U
2L 3L
3U
2U
4U
1L
87 % of the4000 links arebetween individualson the same floor
Communication across corporate geography
Cubicle distance vs. probability of being linked
102
103
10-3
10-2
10-1
100
distance in feet
pro
po
rtio
n o
f lin
ked
pa
irs
measured1/r
1/r2
optimum for search
Finding someone in a sea of cubicles
0 2 4 6 8 10 12 14 16 18 200
2000
4000
6000
8000
10000
12000
14000
16000
number of steps
nu
mb
er
of p
airs
median = 7mean = 12
Example of search path
distance 1
distance 1
distance 2
hierarchical distance = 5search path distance = 4
distance 1
Probability of linking vs. distance in hierarchy
in the ‘searchable’ regime: 0 < < 2 (Watts 2001)
2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6p
rob
ab
ility
of l
inki
ng
hierarchical distance h
observedfit exp(-0.92*h)
Results
0 5 10 15 20 250
1
2
3
4
5x 10
4
number of steps in search
nu
mb
er o
f p
airs
distance search geodesic org random
median 4 3 6 28
mean 5.7 (4.7) 3.1 6.1 57.4
101
102
10-2
10-1
100
pro
ba
bili
ty o
f lin
kin
g
group size g
observed
fit g-0.74
g-1
optimum forsearch (Kleinberg 2001)
Group size and probability of linking
group size g
Search Conclusions
Individuals associate on different levels into groups.
Group structure facilitates decentralized search using social ties.
HP Labs as a social network is searchable but not quite optimal. searching using the organizational hierarchy is faster than using physical location
A fraction of ‘important’ individuals are easily findable
Humans may be much more resourceful in executing search tasks:making use of weak tiesusing more sophisticated strategies
PeopleFinder2 – a search engine for HP people
Live Demo
If live demo fails:Current PeopleFinder functionalityPeopleFinder2 info on a personExtracted topics for a personSocial network Social network visualizationSearch for individuals by topicVisualize knowledge networkFind social network paths to experts
Extract & disambiguate names from publicly available documentsEnrich information available about individualsSearch for them by topicIdentify knowledge communities from co-occurrence of names