23
Mining and Analyzing Online Social Graph Data Drew Conway New York University, Dept. of Politics May 13, 2010

Mining Social Graph Data

Embed Size (px)

DESCRIPTION

Drew Conway, PhD student in NYU\'s Department of Politics, provides an introduction to mining social graph data from the Internet that focuses on the technical, substantive and ethical concerns related to this type of analysis.

Citation preview

Page 1: Mining Social Graph Data

Mining and Analyzing Online Social Graph Data

Drew Conway

New York University Dept of Politics

May 13 2010

Agenda

Network basics

I Unit of analysis

I Data representation

I Analysis amp Visualization

Thoughts on research design

I Edge contexts

I A bit on social graph data and ethics

Digging into the data

I Where to get it

I Our first scrape and build

Real-time social graph analysis

I Build the network of Twitter users in the room

I Hold our breath

Live demo setup

Last week Bruno asked how many of you use Twitter

I 15 (54) members said yes which may be enough to do some interestingstuff

Please tweet the following hash-tag

analyticsnyc

Some example tweets

I Many thanks to the GiltGroupe for hosting tonightrsquos analyticsnycmeetup

I Hanging out with DKALab and brunocm at analyticsnyc

I I cannot wait to watch drewconway crash and burn with hisanalyticsnyc live demo

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Networks and the study of relationships

Network theory use the language of graph theory

G = V Euarr

In the abstract this is very powerful we have a general purpose way torepresent any number of relations

I G=RoutersPacket TrafficEdges have non-binary values

I G=US Airports Commercial RoutesEdges can represent distance cost frequency etc

I G=New York City nerds Co-membership in MeetupsEdges are implied

While both nodes and edges are needed to have a network of any substance itis the edge (relationship) that will always be the primary focus of our analyses

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Why focus on the edge

Consider a very simple example of two people meeting for the first time

If we focus on the nodes we mightconsider this meeting creates adyad

We know however that people donot exist as isolates and thisassumption is ignoring all ofexogenous social structure theseindividuals brings with them

In reality the meeting may reveal a largedegree of shared structure

By expressing this meeting in terms of thenodes as a function of their edges we maygain a much richer understanding of thestructural dynamics

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Representing network data the canonical

Perhaps the most natural way to represent the relationships between N actorsis with an NxN matrix often referred to as a ldquosociomatrixrdquo

X1 X2 XN

X1 0 1 0X2 1 0 1

XN 0 1 0

I Very intuitive representation

I Can support directional and weighted edges

I Application of matrix algebraic operation foranalysis

As an introduction to representing relationship a matrix provides insight to anetwork by itself In practice however this representation has many limitations

I Unwieldy as network sizes scale up

I Most networks are sparse too many zeroes

I Difficult to publish and share

Fortunately there are many other options for representing network data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Representing network data the practical

Remember it is all about the edges therefore more efficient representationswill be limited to edge data

Edge list

A text file with two columns (usuallydelimited by a space or tab) where firstcolumn in source and second is target

1 2

1 3

2 1

6 7

6 8

10 12

10 13

Adjacency list

Also a text file however here the firstcolumn is the source and allsubsequent entries are ldquoadjacentrdquonodes

1 2 3

2 1

6 7 8

10 12 13

While these are the most universal data formats network data representation isa bit of a cottage industry

I Pajek (net)I GraphML (gml)I GraphViz (dot)I and domain specific formats (eek)

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

A bit on tools

The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years

I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while

others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of

technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research

Standalone Apps Modules amp Packages

Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)

Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)

Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Comparing two network metrics to find key actors

Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure

I Degree (number of connections)

I Betweenness (number of shortest paths an actor is on)

I Closeness (relative distance to all other actors)

I Eigenvector centrality (leading eigenvector of sociomatrix)

One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note

I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor

I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

First visualize the data

Visualization can be the best first step in the analytical process

I This will give you a goodfeel for what is going onwith your relationships

I For large networks this isoften not possible

I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges

I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

The first steps are to load the data into memory and perform some basiccentrality analysis1

Load the data into igraph

library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal

Store metrics in new data frame

centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step

1Weeks et al (2002) httpdxdoiorg101023A1015457400897

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

Plot the data

library(ggplot2)

We use ggplot2 to make things a

bit prettier

plt-ggplot(centaes(x=bety=eig

label=rownames(cent)colour=res

size=abs(res)))+xlab(Betweenness

Centrality)+ylab(Eigenvector

Centrality)

We use the residuals to color and

shape the points of our plot

making it easier to spot outliers

p+geom_text()+opts(title=Key Actor

Analysis for Hartford Drug Users)

We use the geom_text function to plot

the actorsrsquo IDrsquos rather than points

so we know who is who

Key Actor Analysis for Hartford Drug Users

Betweenness Centrality

Eig

enve

ctor

C

entr

ality

1 2 3 45

6

7 8

9

1011

12

13 1415

16

17

1819

20

21

22

23

24

2526

27

28

29303132

33

34 35

36

37 3839 40

41

42

43

44

4546

47

48

49

50

51

52

53

5455

56

57

58

59 6061 62

63

64

65

66

67

6869

7071 72 7374

75

76

77

78 798081 8283

8485

86

87

88

89

90

91

92

93

94

9596 9798 99

100

101

102103

104

105

106107108109

110

111

112113 114 115116 117118119120121

122 123124

125

126

127128129

130

131132133134135

136

137138

139

140

141

142143 144145146

147

148149

150151152153 154

155

156 157158 159 160161

162

163164

165

166167

168

169

170

171

172

173

174

175176

177

178179180

181

182

183

184

185

186187188189 190191 19219319400

02

04

06

08

10

0 1000 2000 3000 4000 5000 6000

res

minus02

0

02

04

06

abs(res)

01

02

03

04

05

06

07

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Key Actor Plot

20

284447

50

53

58

79

102

141155 Network plot

Create positions for all of

the nodes w force directed

llt-layoutfruchtermanreingold(G

niter=500)

Set the nodesrsquo size relative to

their residual value

V(G)$sizelt-abs(res)10

Only display the labels of key

players

nodeslt-asvector(V(G)+1)

Key players defined as have a

residual value gt25

nodes[which(abs(res)lt25)]lt-NA

Save plot as PDF

pdf(lsquoactor_plotpdfrsquopointsize=7)

plot(Glayout=lvertexlabel=nodes

vertexlabeldist=025

vertexlabelcolor=lsquoredrsquoedgewidth=1)

devoff()

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Not all edges are created equally

I have spent a lot of time tonight describing network analysis as a way tounderstand relationships

I Depending on the source and context of the data these relationships can be interpreted as many differentthings

I Must consider the data generation process by which the edge was created

I With respect to online social graph data ask how do people use this service

Twitter

Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you

I Geography andhistory less important

I Networks may clusteraround communitiesof interest

Facebook

Ties here are driven bypersonal contactme rarr offline rarr you

I Offline relationshipsare drivingldquofriendingrdquo

I Considerable amountof meta-data alreadyin FB what does thisadd

Google SocialGraph

Connecting all of Googlersquossocial sites togetherme rarr anything rarr you

I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 2: Mining Social Graph Data

Agenda

Network basics

I Unit of analysis

I Data representation

I Analysis amp Visualization

Thoughts on research design

I Edge contexts

I A bit on social graph data and ethics

Digging into the data

I Where to get it

I Our first scrape and build

Real-time social graph analysis

I Build the network of Twitter users in the room

I Hold our breath

Live demo setup

Last week Bruno asked how many of you use Twitter

I 15 (54) members said yes which may be enough to do some interestingstuff

Please tweet the following hash-tag

analyticsnyc

Some example tweets

I Many thanks to the GiltGroupe for hosting tonightrsquos analyticsnycmeetup

I Hanging out with DKALab and brunocm at analyticsnyc

I I cannot wait to watch drewconway crash and burn with hisanalyticsnyc live demo

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Networks and the study of relationships

Network theory use the language of graph theory

G = V Euarr

In the abstract this is very powerful we have a general purpose way torepresent any number of relations

I G=RoutersPacket TrafficEdges have non-binary values

I G=US Airports Commercial RoutesEdges can represent distance cost frequency etc

I G=New York City nerds Co-membership in MeetupsEdges are implied

While both nodes and edges are needed to have a network of any substance itis the edge (relationship) that will always be the primary focus of our analyses

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Why focus on the edge

Consider a very simple example of two people meeting for the first time

If we focus on the nodes we mightconsider this meeting creates adyad

We know however that people donot exist as isolates and thisassumption is ignoring all ofexogenous social structure theseindividuals brings with them

In reality the meeting may reveal a largedegree of shared structure

By expressing this meeting in terms of thenodes as a function of their edges we maygain a much richer understanding of thestructural dynamics

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Representing network data the canonical

Perhaps the most natural way to represent the relationships between N actorsis with an NxN matrix often referred to as a ldquosociomatrixrdquo

X1 X2 XN

X1 0 1 0X2 1 0 1

XN 0 1 0

I Very intuitive representation

I Can support directional and weighted edges

I Application of matrix algebraic operation foranalysis

As an introduction to representing relationship a matrix provides insight to anetwork by itself In practice however this representation has many limitations

I Unwieldy as network sizes scale up

I Most networks are sparse too many zeroes

I Difficult to publish and share

Fortunately there are many other options for representing network data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Representing network data the practical

Remember it is all about the edges therefore more efficient representationswill be limited to edge data

Edge list

A text file with two columns (usuallydelimited by a space or tab) where firstcolumn in source and second is target

1 2

1 3

2 1

6 7

6 8

10 12

10 13

Adjacency list

Also a text file however here the firstcolumn is the source and allsubsequent entries are ldquoadjacentrdquonodes

1 2 3

2 1

6 7 8

10 12 13

While these are the most universal data formats network data representation isa bit of a cottage industry

I Pajek (net)I GraphML (gml)I GraphViz (dot)I and domain specific formats (eek)

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

A bit on tools

The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years

I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while

others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of

technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research

Standalone Apps Modules amp Packages

Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)

Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)

Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Comparing two network metrics to find key actors

Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure

I Degree (number of connections)

I Betweenness (number of shortest paths an actor is on)

I Closeness (relative distance to all other actors)

I Eigenvector centrality (leading eigenvector of sociomatrix)

One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note

I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor

I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

First visualize the data

Visualization can be the best first step in the analytical process

I This will give you a goodfeel for what is going onwith your relationships

I For large networks this isoften not possible

I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges

I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

The first steps are to load the data into memory and perform some basiccentrality analysis1

Load the data into igraph

library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal

Store metrics in new data frame

centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step

1Weeks et al (2002) httpdxdoiorg101023A1015457400897

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

Plot the data

library(ggplot2)

We use ggplot2 to make things a

bit prettier

plt-ggplot(centaes(x=bety=eig

label=rownames(cent)colour=res

size=abs(res)))+xlab(Betweenness

Centrality)+ylab(Eigenvector

Centrality)

We use the residuals to color and

shape the points of our plot

making it easier to spot outliers

p+geom_text()+opts(title=Key Actor

Analysis for Hartford Drug Users)

We use the geom_text function to plot

the actorsrsquo IDrsquos rather than points

so we know who is who

Key Actor Analysis for Hartford Drug Users

Betweenness Centrality

Eig

enve

ctor

C

entr

ality

1 2 3 45

6

7 8

9

1011

12

13 1415

16

17

1819

20

21

22

23

24

2526

27

28

29303132

33

34 35

36

37 3839 40

41

42

43

44

4546

47

48

49

50

51

52

53

5455

56

57

58

59 6061 62

63

64

65

66

67

6869

7071 72 7374

75

76

77

78 798081 8283

8485

86

87

88

89

90

91

92

93

94

9596 9798 99

100

101

102103

104

105

106107108109

110

111

112113 114 115116 117118119120121

122 123124

125

126

127128129

130

131132133134135

136

137138

139

140

141

142143 144145146

147

148149

150151152153 154

155

156 157158 159 160161

162

163164

165

166167

168

169

170

171

172

173

174

175176

177

178179180

181

182

183

184

185

186187188189 190191 19219319400

02

04

06

08

10

0 1000 2000 3000 4000 5000 6000

res

minus02

0

02

04

06

abs(res)

01

02

03

04

05

06

07

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Key Actor Plot

20

284447

50

53

58

79

102

141155 Network plot

Create positions for all of

the nodes w force directed

llt-layoutfruchtermanreingold(G

niter=500)

Set the nodesrsquo size relative to

their residual value

V(G)$sizelt-abs(res)10

Only display the labels of key

players

nodeslt-asvector(V(G)+1)

Key players defined as have a

residual value gt25

nodes[which(abs(res)lt25)]lt-NA

Save plot as PDF

pdf(lsquoactor_plotpdfrsquopointsize=7)

plot(Glayout=lvertexlabel=nodes

vertexlabeldist=025

vertexlabelcolor=lsquoredrsquoedgewidth=1)

devoff()

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Not all edges are created equally

I have spent a lot of time tonight describing network analysis as a way tounderstand relationships

I Depending on the source and context of the data these relationships can be interpreted as many differentthings

I Must consider the data generation process by which the edge was created

I With respect to online social graph data ask how do people use this service

Twitter

Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you

I Geography andhistory less important

I Networks may clusteraround communitiesof interest

Facebook

Ties here are driven bypersonal contactme rarr offline rarr you

I Offline relationshipsare drivingldquofriendingrdquo

I Considerable amountof meta-data alreadyin FB what does thisadd

Google SocialGraph

Connecting all of Googlersquossocial sites togetherme rarr anything rarr you

I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 3: Mining Social Graph Data

Live demo setup

Last week Bruno asked how many of you use Twitter

I 15 (54) members said yes which may be enough to do some interestingstuff

Please tweet the following hash-tag

analyticsnyc

Some example tweets

I Many thanks to the GiltGroupe for hosting tonightrsquos analyticsnycmeetup

I Hanging out with DKALab and brunocm at analyticsnyc

I I cannot wait to watch drewconway crash and burn with hisanalyticsnyc live demo

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Networks and the study of relationships

Network theory use the language of graph theory

G = V Euarr

In the abstract this is very powerful we have a general purpose way torepresent any number of relations

I G=RoutersPacket TrafficEdges have non-binary values

I G=US Airports Commercial RoutesEdges can represent distance cost frequency etc

I G=New York City nerds Co-membership in MeetupsEdges are implied

While both nodes and edges are needed to have a network of any substance itis the edge (relationship) that will always be the primary focus of our analyses

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Why focus on the edge

Consider a very simple example of two people meeting for the first time

If we focus on the nodes we mightconsider this meeting creates adyad

We know however that people donot exist as isolates and thisassumption is ignoring all ofexogenous social structure theseindividuals brings with them

In reality the meeting may reveal a largedegree of shared structure

By expressing this meeting in terms of thenodes as a function of their edges we maygain a much richer understanding of thestructural dynamics

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Representing network data the canonical

Perhaps the most natural way to represent the relationships between N actorsis with an NxN matrix often referred to as a ldquosociomatrixrdquo

X1 X2 XN

X1 0 1 0X2 1 0 1

XN 0 1 0

I Very intuitive representation

I Can support directional and weighted edges

I Application of matrix algebraic operation foranalysis

As an introduction to representing relationship a matrix provides insight to anetwork by itself In practice however this representation has many limitations

I Unwieldy as network sizes scale up

I Most networks are sparse too many zeroes

I Difficult to publish and share

Fortunately there are many other options for representing network data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Representing network data the practical

Remember it is all about the edges therefore more efficient representationswill be limited to edge data

Edge list

A text file with two columns (usuallydelimited by a space or tab) where firstcolumn in source and second is target

1 2

1 3

2 1

6 7

6 8

10 12

10 13

Adjacency list

Also a text file however here the firstcolumn is the source and allsubsequent entries are ldquoadjacentrdquonodes

1 2 3

2 1

6 7 8

10 12 13

While these are the most universal data formats network data representation isa bit of a cottage industry

I Pajek (net)I GraphML (gml)I GraphViz (dot)I and domain specific formats (eek)

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

A bit on tools

The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years

I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while

others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of

technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research

Standalone Apps Modules amp Packages

Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)

Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)

Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Comparing two network metrics to find key actors

Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure

I Degree (number of connections)

I Betweenness (number of shortest paths an actor is on)

I Closeness (relative distance to all other actors)

I Eigenvector centrality (leading eigenvector of sociomatrix)

One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note

I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor

I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

First visualize the data

Visualization can be the best first step in the analytical process

I This will give you a goodfeel for what is going onwith your relationships

I For large networks this isoften not possible

I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges

I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

The first steps are to load the data into memory and perform some basiccentrality analysis1

Load the data into igraph

library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal

Store metrics in new data frame

centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step

1Weeks et al (2002) httpdxdoiorg101023A1015457400897

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

Plot the data

library(ggplot2)

We use ggplot2 to make things a

bit prettier

plt-ggplot(centaes(x=bety=eig

label=rownames(cent)colour=res

size=abs(res)))+xlab(Betweenness

Centrality)+ylab(Eigenvector

Centrality)

We use the residuals to color and

shape the points of our plot

making it easier to spot outliers

p+geom_text()+opts(title=Key Actor

Analysis for Hartford Drug Users)

We use the geom_text function to plot

the actorsrsquo IDrsquos rather than points

so we know who is who

Key Actor Analysis for Hartford Drug Users

Betweenness Centrality

Eig

enve

ctor

C

entr

ality

1 2 3 45

6

7 8

9

1011

12

13 1415

16

17

1819

20

21

22

23

24

2526

27

28

29303132

33

34 35

36

37 3839 40

41

42

43

44

4546

47

48

49

50

51

52

53

5455

56

57

58

59 6061 62

63

64

65

66

67

6869

7071 72 7374

75

76

77

78 798081 8283

8485

86

87

88

89

90

91

92

93

94

9596 9798 99

100

101

102103

104

105

106107108109

110

111

112113 114 115116 117118119120121

122 123124

125

126

127128129

130

131132133134135

136

137138

139

140

141

142143 144145146

147

148149

150151152153 154

155

156 157158 159 160161

162

163164

165

166167

168

169

170

171

172

173

174

175176

177

178179180

181

182

183

184

185

186187188189 190191 19219319400

02

04

06

08

10

0 1000 2000 3000 4000 5000 6000

res

minus02

0

02

04

06

abs(res)

01

02

03

04

05

06

07

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Key Actor Plot

20

284447

50

53

58

79

102

141155 Network plot

Create positions for all of

the nodes w force directed

llt-layoutfruchtermanreingold(G

niter=500)

Set the nodesrsquo size relative to

their residual value

V(G)$sizelt-abs(res)10

Only display the labels of key

players

nodeslt-asvector(V(G)+1)

Key players defined as have a

residual value gt25

nodes[which(abs(res)lt25)]lt-NA

Save plot as PDF

pdf(lsquoactor_plotpdfrsquopointsize=7)

plot(Glayout=lvertexlabel=nodes

vertexlabeldist=025

vertexlabelcolor=lsquoredrsquoedgewidth=1)

devoff()

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Not all edges are created equally

I have spent a lot of time tonight describing network analysis as a way tounderstand relationships

I Depending on the source and context of the data these relationships can be interpreted as many differentthings

I Must consider the data generation process by which the edge was created

I With respect to online social graph data ask how do people use this service

Twitter

Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you

I Geography andhistory less important

I Networks may clusteraround communitiesof interest

Facebook

Ties here are driven bypersonal contactme rarr offline rarr you

I Offline relationshipsare drivingldquofriendingrdquo

I Considerable amountof meta-data alreadyin FB what does thisadd

Google SocialGraph

Connecting all of Googlersquossocial sites togetherme rarr anything rarr you

I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 4: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Networks and the study of relationships

Network theory use the language of graph theory

G = V Euarr

In the abstract this is very powerful we have a general purpose way torepresent any number of relations

I G=RoutersPacket TrafficEdges have non-binary values

I G=US Airports Commercial RoutesEdges can represent distance cost frequency etc

I G=New York City nerds Co-membership in MeetupsEdges are implied

While both nodes and edges are needed to have a network of any substance itis the edge (relationship) that will always be the primary focus of our analyses

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Why focus on the edge

Consider a very simple example of two people meeting for the first time

If we focus on the nodes we mightconsider this meeting creates adyad

We know however that people donot exist as isolates and thisassumption is ignoring all ofexogenous social structure theseindividuals brings with them

In reality the meeting may reveal a largedegree of shared structure

By expressing this meeting in terms of thenodes as a function of their edges we maygain a much richer understanding of thestructural dynamics

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Representing network data the canonical

Perhaps the most natural way to represent the relationships between N actorsis with an NxN matrix often referred to as a ldquosociomatrixrdquo

X1 X2 XN

X1 0 1 0X2 1 0 1

XN 0 1 0

I Very intuitive representation

I Can support directional and weighted edges

I Application of matrix algebraic operation foranalysis

As an introduction to representing relationship a matrix provides insight to anetwork by itself In practice however this representation has many limitations

I Unwieldy as network sizes scale up

I Most networks are sparse too many zeroes

I Difficult to publish and share

Fortunately there are many other options for representing network data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Representing network data the practical

Remember it is all about the edges therefore more efficient representationswill be limited to edge data

Edge list

A text file with two columns (usuallydelimited by a space or tab) where firstcolumn in source and second is target

1 2

1 3

2 1

6 7

6 8

10 12

10 13

Adjacency list

Also a text file however here the firstcolumn is the source and allsubsequent entries are ldquoadjacentrdquonodes

1 2 3

2 1

6 7 8

10 12 13

While these are the most universal data formats network data representation isa bit of a cottage industry

I Pajek (net)I GraphML (gml)I GraphViz (dot)I and domain specific formats (eek)

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

A bit on tools

The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years

I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while

others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of

technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research

Standalone Apps Modules amp Packages

Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)

Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)

Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Comparing two network metrics to find key actors

Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure

I Degree (number of connections)

I Betweenness (number of shortest paths an actor is on)

I Closeness (relative distance to all other actors)

I Eigenvector centrality (leading eigenvector of sociomatrix)

One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note

I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor

I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

First visualize the data

Visualization can be the best first step in the analytical process

I This will give you a goodfeel for what is going onwith your relationships

I For large networks this isoften not possible

I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges

I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

The first steps are to load the data into memory and perform some basiccentrality analysis1

Load the data into igraph

library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal

Store metrics in new data frame

centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step

1Weeks et al (2002) httpdxdoiorg101023A1015457400897

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

Plot the data

library(ggplot2)

We use ggplot2 to make things a

bit prettier

plt-ggplot(centaes(x=bety=eig

label=rownames(cent)colour=res

size=abs(res)))+xlab(Betweenness

Centrality)+ylab(Eigenvector

Centrality)

We use the residuals to color and

shape the points of our plot

making it easier to spot outliers

p+geom_text()+opts(title=Key Actor

Analysis for Hartford Drug Users)

We use the geom_text function to plot

the actorsrsquo IDrsquos rather than points

so we know who is who

Key Actor Analysis for Hartford Drug Users

Betweenness Centrality

Eig

enve

ctor

C

entr

ality

1 2 3 45

6

7 8

9

1011

12

13 1415

16

17

1819

20

21

22

23

24

2526

27

28

29303132

33

34 35

36

37 3839 40

41

42

43

44

4546

47

48

49

50

51

52

53

5455

56

57

58

59 6061 62

63

64

65

66

67

6869

7071 72 7374

75

76

77

78 798081 8283

8485

86

87

88

89

90

91

92

93

94

9596 9798 99

100

101

102103

104

105

106107108109

110

111

112113 114 115116 117118119120121

122 123124

125

126

127128129

130

131132133134135

136

137138

139

140

141

142143 144145146

147

148149

150151152153 154

155

156 157158 159 160161

162

163164

165

166167

168

169

170

171

172

173

174

175176

177

178179180

181

182

183

184

185

186187188189 190191 19219319400

02

04

06

08

10

0 1000 2000 3000 4000 5000 6000

res

minus02

0

02

04

06

abs(res)

01

02

03

04

05

06

07

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Key Actor Plot

20

284447

50

53

58

79

102

141155 Network plot

Create positions for all of

the nodes w force directed

llt-layoutfruchtermanreingold(G

niter=500)

Set the nodesrsquo size relative to

their residual value

V(G)$sizelt-abs(res)10

Only display the labels of key

players

nodeslt-asvector(V(G)+1)

Key players defined as have a

residual value gt25

nodes[which(abs(res)lt25)]lt-NA

Save plot as PDF

pdf(lsquoactor_plotpdfrsquopointsize=7)

plot(Glayout=lvertexlabel=nodes

vertexlabeldist=025

vertexlabelcolor=lsquoredrsquoedgewidth=1)

devoff()

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Not all edges are created equally

I have spent a lot of time tonight describing network analysis as a way tounderstand relationships

I Depending on the source and context of the data these relationships can be interpreted as many differentthings

I Must consider the data generation process by which the edge was created

I With respect to online social graph data ask how do people use this service

Twitter

Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you

I Geography andhistory less important

I Networks may clusteraround communitiesof interest

Facebook

Ties here are driven bypersonal contactme rarr offline rarr you

I Offline relationshipsare drivingldquofriendingrdquo

I Considerable amountof meta-data alreadyin FB what does thisadd

Google SocialGraph

Connecting all of Googlersquossocial sites togetherme rarr anything rarr you

I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 5: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Why focus on the edge

Consider a very simple example of two people meeting for the first time

If we focus on the nodes we mightconsider this meeting creates adyad

We know however that people donot exist as isolates and thisassumption is ignoring all ofexogenous social structure theseindividuals brings with them

In reality the meeting may reveal a largedegree of shared structure

By expressing this meeting in terms of thenodes as a function of their edges we maygain a much richer understanding of thestructural dynamics

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Representing network data the canonical

Perhaps the most natural way to represent the relationships between N actorsis with an NxN matrix often referred to as a ldquosociomatrixrdquo

X1 X2 XN

X1 0 1 0X2 1 0 1

XN 0 1 0

I Very intuitive representation

I Can support directional and weighted edges

I Application of matrix algebraic operation foranalysis

As an introduction to representing relationship a matrix provides insight to anetwork by itself In practice however this representation has many limitations

I Unwieldy as network sizes scale up

I Most networks are sparse too many zeroes

I Difficult to publish and share

Fortunately there are many other options for representing network data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Representing network data the practical

Remember it is all about the edges therefore more efficient representationswill be limited to edge data

Edge list

A text file with two columns (usuallydelimited by a space or tab) where firstcolumn in source and second is target

1 2

1 3

2 1

6 7

6 8

10 12

10 13

Adjacency list

Also a text file however here the firstcolumn is the source and allsubsequent entries are ldquoadjacentrdquonodes

1 2 3

2 1

6 7 8

10 12 13

While these are the most universal data formats network data representation isa bit of a cottage industry

I Pajek (net)I GraphML (gml)I GraphViz (dot)I and domain specific formats (eek)

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

A bit on tools

The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years

I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while

others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of

technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research

Standalone Apps Modules amp Packages

Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)

Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)

Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Comparing two network metrics to find key actors

Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure

I Degree (number of connections)

I Betweenness (number of shortest paths an actor is on)

I Closeness (relative distance to all other actors)

I Eigenvector centrality (leading eigenvector of sociomatrix)

One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note

I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor

I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

First visualize the data

Visualization can be the best first step in the analytical process

I This will give you a goodfeel for what is going onwith your relationships

I For large networks this isoften not possible

I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges

I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

The first steps are to load the data into memory and perform some basiccentrality analysis1

Load the data into igraph

library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal

Store metrics in new data frame

centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step

1Weeks et al (2002) httpdxdoiorg101023A1015457400897

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

Plot the data

library(ggplot2)

We use ggplot2 to make things a

bit prettier

plt-ggplot(centaes(x=bety=eig

label=rownames(cent)colour=res

size=abs(res)))+xlab(Betweenness

Centrality)+ylab(Eigenvector

Centrality)

We use the residuals to color and

shape the points of our plot

making it easier to spot outliers

p+geom_text()+opts(title=Key Actor

Analysis for Hartford Drug Users)

We use the geom_text function to plot

the actorsrsquo IDrsquos rather than points

so we know who is who

Key Actor Analysis for Hartford Drug Users

Betweenness Centrality

Eig

enve

ctor

C

entr

ality

1 2 3 45

6

7 8

9

1011

12

13 1415

16

17

1819

20

21

22

23

24

2526

27

28

29303132

33

34 35

36

37 3839 40

41

42

43

44

4546

47

48

49

50

51

52

53

5455

56

57

58

59 6061 62

63

64

65

66

67

6869

7071 72 7374

75

76

77

78 798081 8283

8485

86

87

88

89

90

91

92

93

94

9596 9798 99

100

101

102103

104

105

106107108109

110

111

112113 114 115116 117118119120121

122 123124

125

126

127128129

130

131132133134135

136

137138

139

140

141

142143 144145146

147

148149

150151152153 154

155

156 157158 159 160161

162

163164

165

166167

168

169

170

171

172

173

174

175176

177

178179180

181

182

183

184

185

186187188189 190191 19219319400

02

04

06

08

10

0 1000 2000 3000 4000 5000 6000

res

minus02

0

02

04

06

abs(res)

01

02

03

04

05

06

07

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Key Actor Plot

20

284447

50

53

58

79

102

141155 Network plot

Create positions for all of

the nodes w force directed

llt-layoutfruchtermanreingold(G

niter=500)

Set the nodesrsquo size relative to

their residual value

V(G)$sizelt-abs(res)10

Only display the labels of key

players

nodeslt-asvector(V(G)+1)

Key players defined as have a

residual value gt25

nodes[which(abs(res)lt25)]lt-NA

Save plot as PDF

pdf(lsquoactor_plotpdfrsquopointsize=7)

plot(Glayout=lvertexlabel=nodes

vertexlabeldist=025

vertexlabelcolor=lsquoredrsquoedgewidth=1)

devoff()

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Not all edges are created equally

I have spent a lot of time tonight describing network analysis as a way tounderstand relationships

I Depending on the source and context of the data these relationships can be interpreted as many differentthings

I Must consider the data generation process by which the edge was created

I With respect to online social graph data ask how do people use this service

Twitter

Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you

I Geography andhistory less important

I Networks may clusteraround communitiesof interest

Facebook

Ties here are driven bypersonal contactme rarr offline rarr you

I Offline relationshipsare drivingldquofriendingrdquo

I Considerable amountof meta-data alreadyin FB what does thisadd

Google SocialGraph

Connecting all of Googlersquossocial sites togetherme rarr anything rarr you

I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 6: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Representing network data the canonical

Perhaps the most natural way to represent the relationships between N actorsis with an NxN matrix often referred to as a ldquosociomatrixrdquo

X1 X2 XN

X1 0 1 0X2 1 0 1

XN 0 1 0

I Very intuitive representation

I Can support directional and weighted edges

I Application of matrix algebraic operation foranalysis

As an introduction to representing relationship a matrix provides insight to anetwork by itself In practice however this representation has many limitations

I Unwieldy as network sizes scale up

I Most networks are sparse too many zeroes

I Difficult to publish and share

Fortunately there are many other options for representing network data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Representing network data the practical

Remember it is all about the edges therefore more efficient representationswill be limited to edge data

Edge list

A text file with two columns (usuallydelimited by a space or tab) where firstcolumn in source and second is target

1 2

1 3

2 1

6 7

6 8

10 12

10 13

Adjacency list

Also a text file however here the firstcolumn is the source and allsubsequent entries are ldquoadjacentrdquonodes

1 2 3

2 1

6 7 8

10 12 13

While these are the most universal data formats network data representation isa bit of a cottage industry

I Pajek (net)I GraphML (gml)I GraphViz (dot)I and domain specific formats (eek)

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

A bit on tools

The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years

I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while

others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of

technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research

Standalone Apps Modules amp Packages

Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)

Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)

Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Comparing two network metrics to find key actors

Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure

I Degree (number of connections)

I Betweenness (number of shortest paths an actor is on)

I Closeness (relative distance to all other actors)

I Eigenvector centrality (leading eigenvector of sociomatrix)

One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note

I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor

I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

First visualize the data

Visualization can be the best first step in the analytical process

I This will give you a goodfeel for what is going onwith your relationships

I For large networks this isoften not possible

I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges

I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

The first steps are to load the data into memory and perform some basiccentrality analysis1

Load the data into igraph

library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal

Store metrics in new data frame

centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step

1Weeks et al (2002) httpdxdoiorg101023A1015457400897

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

Plot the data

library(ggplot2)

We use ggplot2 to make things a

bit prettier

plt-ggplot(centaes(x=bety=eig

label=rownames(cent)colour=res

size=abs(res)))+xlab(Betweenness

Centrality)+ylab(Eigenvector

Centrality)

We use the residuals to color and

shape the points of our plot

making it easier to spot outliers

p+geom_text()+opts(title=Key Actor

Analysis for Hartford Drug Users)

We use the geom_text function to plot

the actorsrsquo IDrsquos rather than points

so we know who is who

Key Actor Analysis for Hartford Drug Users

Betweenness Centrality

Eig

enve

ctor

C

entr

ality

1 2 3 45

6

7 8

9

1011

12

13 1415

16

17

1819

20

21

22

23

24

2526

27

28

29303132

33

34 35

36

37 3839 40

41

42

43

44

4546

47

48

49

50

51

52

53

5455

56

57

58

59 6061 62

63

64

65

66

67

6869

7071 72 7374

75

76

77

78 798081 8283

8485

86

87

88

89

90

91

92

93

94

9596 9798 99

100

101

102103

104

105

106107108109

110

111

112113 114 115116 117118119120121

122 123124

125

126

127128129

130

131132133134135

136

137138

139

140

141

142143 144145146

147

148149

150151152153 154

155

156 157158 159 160161

162

163164

165

166167

168

169

170

171

172

173

174

175176

177

178179180

181

182

183

184

185

186187188189 190191 19219319400

02

04

06

08

10

0 1000 2000 3000 4000 5000 6000

res

minus02

0

02

04

06

abs(res)

01

02

03

04

05

06

07

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Key Actor Plot

20

284447

50

53

58

79

102

141155 Network plot

Create positions for all of

the nodes w force directed

llt-layoutfruchtermanreingold(G

niter=500)

Set the nodesrsquo size relative to

their residual value

V(G)$sizelt-abs(res)10

Only display the labels of key

players

nodeslt-asvector(V(G)+1)

Key players defined as have a

residual value gt25

nodes[which(abs(res)lt25)]lt-NA

Save plot as PDF

pdf(lsquoactor_plotpdfrsquopointsize=7)

plot(Glayout=lvertexlabel=nodes

vertexlabeldist=025

vertexlabelcolor=lsquoredrsquoedgewidth=1)

devoff()

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Not all edges are created equally

I have spent a lot of time tonight describing network analysis as a way tounderstand relationships

I Depending on the source and context of the data these relationships can be interpreted as many differentthings

I Must consider the data generation process by which the edge was created

I With respect to online social graph data ask how do people use this service

Twitter

Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you

I Geography andhistory less important

I Networks may clusteraround communitiesof interest

Facebook

Ties here are driven bypersonal contactme rarr offline rarr you

I Offline relationshipsare drivingldquofriendingrdquo

I Considerable amountof meta-data alreadyin FB what does thisadd

Google SocialGraph

Connecting all of Googlersquossocial sites togetherme rarr anything rarr you

I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 7: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Representing network data the practical

Remember it is all about the edges therefore more efficient representationswill be limited to edge data

Edge list

A text file with two columns (usuallydelimited by a space or tab) where firstcolumn in source and second is target

1 2

1 3

2 1

6 7

6 8

10 12

10 13

Adjacency list

Also a text file however here the firstcolumn is the source and allsubsequent entries are ldquoadjacentrdquonodes

1 2 3

2 1

6 7 8

10 12 13

While these are the most universal data formats network data representation isa bit of a cottage industry

I Pajek (net)I GraphML (gml)I GraphViz (dot)I and domain specific formats (eek)

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

A bit on tools

The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years

I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while

others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of

technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research

Standalone Apps Modules amp Packages

Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)

Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)

Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Comparing two network metrics to find key actors

Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure

I Degree (number of connections)

I Betweenness (number of shortest paths an actor is on)

I Closeness (relative distance to all other actors)

I Eigenvector centrality (leading eigenvector of sociomatrix)

One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note

I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor

I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

First visualize the data

Visualization can be the best first step in the analytical process

I This will give you a goodfeel for what is going onwith your relationships

I For large networks this isoften not possible

I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges

I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

The first steps are to load the data into memory and perform some basiccentrality analysis1

Load the data into igraph

library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal

Store metrics in new data frame

centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step

1Weeks et al (2002) httpdxdoiorg101023A1015457400897

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

Plot the data

library(ggplot2)

We use ggplot2 to make things a

bit prettier

plt-ggplot(centaes(x=bety=eig

label=rownames(cent)colour=res

size=abs(res)))+xlab(Betweenness

Centrality)+ylab(Eigenvector

Centrality)

We use the residuals to color and

shape the points of our plot

making it easier to spot outliers

p+geom_text()+opts(title=Key Actor

Analysis for Hartford Drug Users)

We use the geom_text function to plot

the actorsrsquo IDrsquos rather than points

so we know who is who

Key Actor Analysis for Hartford Drug Users

Betweenness Centrality

Eig

enve

ctor

C

entr

ality

1 2 3 45

6

7 8

9

1011

12

13 1415

16

17

1819

20

21

22

23

24

2526

27

28

29303132

33

34 35

36

37 3839 40

41

42

43

44

4546

47

48

49

50

51

52

53

5455

56

57

58

59 6061 62

63

64

65

66

67

6869

7071 72 7374

75

76

77

78 798081 8283

8485

86

87

88

89

90

91

92

93

94

9596 9798 99

100

101

102103

104

105

106107108109

110

111

112113 114 115116 117118119120121

122 123124

125

126

127128129

130

131132133134135

136

137138

139

140

141

142143 144145146

147

148149

150151152153 154

155

156 157158 159 160161

162

163164

165

166167

168

169

170

171

172

173

174

175176

177

178179180

181

182

183

184

185

186187188189 190191 19219319400

02

04

06

08

10

0 1000 2000 3000 4000 5000 6000

res

minus02

0

02

04

06

abs(res)

01

02

03

04

05

06

07

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Key Actor Plot

20

284447

50

53

58

79

102

141155 Network plot

Create positions for all of

the nodes w force directed

llt-layoutfruchtermanreingold(G

niter=500)

Set the nodesrsquo size relative to

their residual value

V(G)$sizelt-abs(res)10

Only display the labels of key

players

nodeslt-asvector(V(G)+1)

Key players defined as have a

residual value gt25

nodes[which(abs(res)lt25)]lt-NA

Save plot as PDF

pdf(lsquoactor_plotpdfrsquopointsize=7)

plot(Glayout=lvertexlabel=nodes

vertexlabeldist=025

vertexlabelcolor=lsquoredrsquoedgewidth=1)

devoff()

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Not all edges are created equally

I have spent a lot of time tonight describing network analysis as a way tounderstand relationships

I Depending on the source and context of the data these relationships can be interpreted as many differentthings

I Must consider the data generation process by which the edge was created

I With respect to online social graph data ask how do people use this service

Twitter

Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you

I Geography andhistory less important

I Networks may clusteraround communitiesof interest

Facebook

Ties here are driven bypersonal contactme rarr offline rarr you

I Offline relationshipsare drivingldquofriendingrdquo

I Considerable amountof meta-data alreadyin FB what does thisadd

Google SocialGraph

Connecting all of Googlersquossocial sites togetherme rarr anything rarr you

I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 8: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

A bit on tools

The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years

I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while

others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of

technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research

Standalone Apps Modules amp Packages

Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)

Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)

Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Comparing two network metrics to find key actors

Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure

I Degree (number of connections)

I Betweenness (number of shortest paths an actor is on)

I Closeness (relative distance to all other actors)

I Eigenvector centrality (leading eigenvector of sociomatrix)

One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note

I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor

I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

First visualize the data

Visualization can be the best first step in the analytical process

I This will give you a goodfeel for what is going onwith your relationships

I For large networks this isoften not possible

I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges

I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

The first steps are to load the data into memory and perform some basiccentrality analysis1

Load the data into igraph

library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal

Store metrics in new data frame

centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step

1Weeks et al (2002) httpdxdoiorg101023A1015457400897

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

Plot the data

library(ggplot2)

We use ggplot2 to make things a

bit prettier

plt-ggplot(centaes(x=bety=eig

label=rownames(cent)colour=res

size=abs(res)))+xlab(Betweenness

Centrality)+ylab(Eigenvector

Centrality)

We use the residuals to color and

shape the points of our plot

making it easier to spot outliers

p+geom_text()+opts(title=Key Actor

Analysis for Hartford Drug Users)

We use the geom_text function to plot

the actorsrsquo IDrsquos rather than points

so we know who is who

Key Actor Analysis for Hartford Drug Users

Betweenness Centrality

Eig

enve

ctor

C

entr

ality

1 2 3 45

6

7 8

9

1011

12

13 1415

16

17

1819

20

21

22

23

24

2526

27

28

29303132

33

34 35

36

37 3839 40

41

42

43

44

4546

47

48

49

50

51

52

53

5455

56

57

58

59 6061 62

63

64

65

66

67

6869

7071 72 7374

75

76

77

78 798081 8283

8485

86

87

88

89

90

91

92

93

94

9596 9798 99

100

101

102103

104

105

106107108109

110

111

112113 114 115116 117118119120121

122 123124

125

126

127128129

130

131132133134135

136

137138

139

140

141

142143 144145146

147

148149

150151152153 154

155

156 157158 159 160161

162

163164

165

166167

168

169

170

171

172

173

174

175176

177

178179180

181

182

183

184

185

186187188189 190191 19219319400

02

04

06

08

10

0 1000 2000 3000 4000 5000 6000

res

minus02

0

02

04

06

abs(res)

01

02

03

04

05

06

07

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Key Actor Plot

20

284447

50

53

58

79

102

141155 Network plot

Create positions for all of

the nodes w force directed

llt-layoutfruchtermanreingold(G

niter=500)

Set the nodesrsquo size relative to

their residual value

V(G)$sizelt-abs(res)10

Only display the labels of key

players

nodeslt-asvector(V(G)+1)

Key players defined as have a

residual value gt25

nodes[which(abs(res)lt25)]lt-NA

Save plot as PDF

pdf(lsquoactor_plotpdfrsquopointsize=7)

plot(Glayout=lvertexlabel=nodes

vertexlabeldist=025

vertexlabelcolor=lsquoredrsquoedgewidth=1)

devoff()

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Not all edges are created equally

I have spent a lot of time tonight describing network analysis as a way tounderstand relationships

I Depending on the source and context of the data these relationships can be interpreted as many differentthings

I Must consider the data generation process by which the edge was created

I With respect to online social graph data ask how do people use this service

Twitter

Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you

I Geography andhistory less important

I Networks may clusteraround communitiesof interest

Facebook

Ties here are driven bypersonal contactme rarr offline rarr you

I Offline relationshipsare drivingldquofriendingrdquo

I Considerable amountof meta-data alreadyin FB what does thisadd

Google SocialGraph

Connecting all of Googlersquossocial sites togetherme rarr anything rarr you

I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 9: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Comparing two network metrics to find key actors

Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure

I Degree (number of connections)

I Betweenness (number of shortest paths an actor is on)

I Closeness (relative distance to all other actors)

I Eigenvector centrality (leading eigenvector of sociomatrix)

One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note

I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor

I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

First visualize the data

Visualization can be the best first step in the analytical process

I This will give you a goodfeel for what is going onwith your relationships

I For large networks this isoften not possible

I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges

I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

The first steps are to load the data into memory and perform some basiccentrality analysis1

Load the data into igraph

library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal

Store metrics in new data frame

centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step

1Weeks et al (2002) httpdxdoiorg101023A1015457400897

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

Plot the data

library(ggplot2)

We use ggplot2 to make things a

bit prettier

plt-ggplot(centaes(x=bety=eig

label=rownames(cent)colour=res

size=abs(res)))+xlab(Betweenness

Centrality)+ylab(Eigenvector

Centrality)

We use the residuals to color and

shape the points of our plot

making it easier to spot outliers

p+geom_text()+opts(title=Key Actor

Analysis for Hartford Drug Users)

We use the geom_text function to plot

the actorsrsquo IDrsquos rather than points

so we know who is who

Key Actor Analysis for Hartford Drug Users

Betweenness Centrality

Eig

enve

ctor

C

entr

ality

1 2 3 45

6

7 8

9

1011

12

13 1415

16

17

1819

20

21

22

23

24

2526

27

28

29303132

33

34 35

36

37 3839 40

41

42

43

44

4546

47

48

49

50

51

52

53

5455

56

57

58

59 6061 62

63

64

65

66

67

6869

7071 72 7374

75

76

77

78 798081 8283

8485

86

87

88

89

90

91

92

93

94

9596 9798 99

100

101

102103

104

105

106107108109

110

111

112113 114 115116 117118119120121

122 123124

125

126

127128129

130

131132133134135

136

137138

139

140

141

142143 144145146

147

148149

150151152153 154

155

156 157158 159 160161

162

163164

165

166167

168

169

170

171

172

173

174

175176

177

178179180

181

182

183

184

185

186187188189 190191 19219319400

02

04

06

08

10

0 1000 2000 3000 4000 5000 6000

res

minus02

0

02

04

06

abs(res)

01

02

03

04

05

06

07

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Key Actor Plot

20

284447

50

53

58

79

102

141155 Network plot

Create positions for all of

the nodes w force directed

llt-layoutfruchtermanreingold(G

niter=500)

Set the nodesrsquo size relative to

their residual value

V(G)$sizelt-abs(res)10

Only display the labels of key

players

nodeslt-asvector(V(G)+1)

Key players defined as have a

residual value gt25

nodes[which(abs(res)lt25)]lt-NA

Save plot as PDF

pdf(lsquoactor_plotpdfrsquopointsize=7)

plot(Glayout=lvertexlabel=nodes

vertexlabeldist=025

vertexlabelcolor=lsquoredrsquoedgewidth=1)

devoff()

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Not all edges are created equally

I have spent a lot of time tonight describing network analysis as a way tounderstand relationships

I Depending on the source and context of the data these relationships can be interpreted as many differentthings

I Must consider the data generation process by which the edge was created

I With respect to online social graph data ask how do people use this service

Twitter

Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you

I Geography andhistory less important

I Networks may clusteraround communitiesof interest

Facebook

Ties here are driven bypersonal contactme rarr offline rarr you

I Offline relationshipsare drivingldquofriendingrdquo

I Considerable amountof meta-data alreadyin FB what does thisadd

Google SocialGraph

Connecting all of Googlersquossocial sites togetherme rarr anything rarr you

I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 10: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

First visualize the data

Visualization can be the best first step in the analytical process

I This will give you a goodfeel for what is going onwith your relationships

I For large networks this isoften not possible

I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges

I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

The first steps are to load the data into memory and perform some basiccentrality analysis1

Load the data into igraph

library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal

Store metrics in new data frame

centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step

1Weeks et al (2002) httpdxdoiorg101023A1015457400897

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

Plot the data

library(ggplot2)

We use ggplot2 to make things a

bit prettier

plt-ggplot(centaes(x=bety=eig

label=rownames(cent)colour=res

size=abs(res)))+xlab(Betweenness

Centrality)+ylab(Eigenvector

Centrality)

We use the residuals to color and

shape the points of our plot

making it easier to spot outliers

p+geom_text()+opts(title=Key Actor

Analysis for Hartford Drug Users)

We use the geom_text function to plot

the actorsrsquo IDrsquos rather than points

so we know who is who

Key Actor Analysis for Hartford Drug Users

Betweenness Centrality

Eig

enve

ctor

C

entr

ality

1 2 3 45

6

7 8

9

1011

12

13 1415

16

17

1819

20

21

22

23

24

2526

27

28

29303132

33

34 35

36

37 3839 40

41

42

43

44

4546

47

48

49

50

51

52

53

5455

56

57

58

59 6061 62

63

64

65

66

67

6869

7071 72 7374

75

76

77

78 798081 8283

8485

86

87

88

89

90

91

92

93

94

9596 9798 99

100

101

102103

104

105

106107108109

110

111

112113 114 115116 117118119120121

122 123124

125

126

127128129

130

131132133134135

136

137138

139

140

141

142143 144145146

147

148149

150151152153 154

155

156 157158 159 160161

162

163164

165

166167

168

169

170

171

172

173

174

175176

177

178179180

181

182

183

184

185

186187188189 190191 19219319400

02

04

06

08

10

0 1000 2000 3000 4000 5000 6000

res

minus02

0

02

04

06

abs(res)

01

02

03

04

05

06

07

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Key Actor Plot

20

284447

50

53

58

79

102

141155 Network plot

Create positions for all of

the nodes w force directed

llt-layoutfruchtermanreingold(G

niter=500)

Set the nodesrsquo size relative to

their residual value

V(G)$sizelt-abs(res)10

Only display the labels of key

players

nodeslt-asvector(V(G)+1)

Key players defined as have a

residual value gt25

nodes[which(abs(res)lt25)]lt-NA

Save plot as PDF

pdf(lsquoactor_plotpdfrsquopointsize=7)

plot(Glayout=lvertexlabel=nodes

vertexlabeldist=025

vertexlabelcolor=lsquoredrsquoedgewidth=1)

devoff()

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Not all edges are created equally

I have spent a lot of time tonight describing network analysis as a way tounderstand relationships

I Depending on the source and context of the data these relationships can be interpreted as many differentthings

I Must consider the data generation process by which the edge was created

I With respect to online social graph data ask how do people use this service

Twitter

Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you

I Geography andhistory less important

I Networks may clusteraround communitiesof interest

Facebook

Ties here are driven bypersonal contactme rarr offline rarr you

I Offline relationshipsare drivingldquofriendingrdquo

I Considerable amountof meta-data alreadyin FB what does thisadd

Google SocialGraph

Connecting all of Googlersquossocial sites togetherme rarr anything rarr you

I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 11: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

The first steps are to load the data into memory and perform some basiccentrality analysis1

Load the data into igraph

library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal

Store metrics in new data frame

centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step

1Weeks et al (2002) httpdxdoiorg101023A1015457400897

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

Plot the data

library(ggplot2)

We use ggplot2 to make things a

bit prettier

plt-ggplot(centaes(x=bety=eig

label=rownames(cent)colour=res

size=abs(res)))+xlab(Betweenness

Centrality)+ylab(Eigenvector

Centrality)

We use the residuals to color and

shape the points of our plot

making it easier to spot outliers

p+geom_text()+opts(title=Key Actor

Analysis for Hartford Drug Users)

We use the geom_text function to plot

the actorsrsquo IDrsquos rather than points

so we know who is who

Key Actor Analysis for Hartford Drug Users

Betweenness Centrality

Eig

enve

ctor

C

entr

ality

1 2 3 45

6

7 8

9

1011

12

13 1415

16

17

1819

20

21

22

23

24

2526

27

28

29303132

33

34 35

36

37 3839 40

41

42

43

44

4546

47

48

49

50

51

52

53

5455

56

57

58

59 6061 62

63

64

65

66

67

6869

7071 72 7374

75

76

77

78 798081 8283

8485

86

87

88

89

90

91

92

93

94

9596 9798 99

100

101

102103

104

105

106107108109

110

111

112113 114 115116 117118119120121

122 123124

125

126

127128129

130

131132133134135

136

137138

139

140

141

142143 144145146

147

148149

150151152153 154

155

156 157158 159 160161

162

163164

165

166167

168

169

170

171

172

173

174

175176

177

178179180

181

182

183

184

185

186187188189 190191 19219319400

02

04

06

08

10

0 1000 2000 3000 4000 5000 6000

res

minus02

0

02

04

06

abs(res)

01

02

03

04

05

06

07

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Key Actor Plot

20

284447

50

53

58

79

102

141155 Network plot

Create positions for all of

the nodes w force directed

llt-layoutfruchtermanreingold(G

niter=500)

Set the nodesrsquo size relative to

their residual value

V(G)$sizelt-abs(res)10

Only display the labels of key

players

nodeslt-asvector(V(G)+1)

Key players defined as have a

residual value gt25

nodes[which(abs(res)lt25)]lt-NA

Save plot as PDF

pdf(lsquoactor_plotpdfrsquopointsize=7)

plot(Glayout=lvertexlabel=nodes

vertexlabeldist=025

vertexlabelcolor=lsquoredrsquoedgewidth=1)

devoff()

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Not all edges are created equally

I have spent a lot of time tonight describing network analysis as a way tounderstand relationships

I Depending on the source and context of the data these relationships can be interpreted as many differentthings

I Must consider the data generation process by which the edge was created

I With respect to online social graph data ask how do people use this service

Twitter

Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you

I Geography andhistory less important

I Networks may clusteraround communitiesof interest

Facebook

Ties here are driven bypersonal contactme rarr offline rarr you

I Offline relationshipsare drivingldquofriendingrdquo

I Considerable amountof meta-data alreadyin FB what does thisadd

Google SocialGraph

Connecting all of Googlersquossocial sites togetherme rarr anything rarr you

I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 12: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Finding Key Actors with R

Plot the data

library(ggplot2)

We use ggplot2 to make things a

bit prettier

plt-ggplot(centaes(x=bety=eig

label=rownames(cent)colour=res

size=abs(res)))+xlab(Betweenness

Centrality)+ylab(Eigenvector

Centrality)

We use the residuals to color and

shape the points of our plot

making it easier to spot outliers

p+geom_text()+opts(title=Key Actor

Analysis for Hartford Drug Users)

We use the geom_text function to plot

the actorsrsquo IDrsquos rather than points

so we know who is who

Key Actor Analysis for Hartford Drug Users

Betweenness Centrality

Eig

enve

ctor

C

entr

ality

1 2 3 45

6

7 8

9

1011

12

13 1415

16

17

1819

20

21

22

23

24

2526

27

28

29303132

33

34 35

36

37 3839 40

41

42

43

44

4546

47

48

49

50

51

52

53

5455

56

57

58

59 6061 62

63

64

65

66

67

6869

7071 72 7374

75

76

77

78 798081 8283

8485

86

87

88

89

90

91

92

93

94

9596 9798 99

100

101

102103

104

105

106107108109

110

111

112113 114 115116 117118119120121

122 123124

125

126

127128129

130

131132133134135

136

137138

139

140

141

142143 144145146

147

148149

150151152153 154

155

156 157158 159 160161

162

163164

165

166167

168

169

170

171

172

173

174

175176

177

178179180

181

182

183

184

185

186187188189 190191 19219319400

02

04

06

08

10

0 1000 2000 3000 4000 5000 6000

res

minus02

0

02

04

06

abs(res)

01

02

03

04

05

06

07

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Key Actor Plot

20

284447

50

53

58

79

102

141155 Network plot

Create positions for all of

the nodes w force directed

llt-layoutfruchtermanreingold(G

niter=500)

Set the nodesrsquo size relative to

their residual value

V(G)$sizelt-abs(res)10

Only display the labels of key

players

nodeslt-asvector(V(G)+1)

Key players defined as have a

residual value gt25

nodes[which(abs(res)lt25)]lt-NA

Save plot as PDF

pdf(lsquoactor_plotpdfrsquopointsize=7)

plot(Glayout=lvertexlabel=nodes

vertexlabeldist=025

vertexlabelcolor=lsquoredrsquoedgewidth=1)

devoff()

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Not all edges are created equally

I have spent a lot of time tonight describing network analysis as a way tounderstand relationships

I Depending on the source and context of the data these relationships can be interpreted as many differentthings

I Must consider the data generation process by which the edge was created

I With respect to online social graph data ask how do people use this service

Twitter

Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you

I Geography andhistory less important

I Networks may clusteraround communitiesof interest

Facebook

Ties here are driven bypersonal contactme rarr offline rarr you

I Offline relationshipsare drivingldquofriendingrdquo

I Considerable amountof meta-data alreadyin FB what does thisadd

Google SocialGraph

Connecting all of Googlersquossocial sites togetherme rarr anything rarr you

I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 13: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

The study of relationshipsData representationAnalysis amp Visualization

Key Actor Plot

20

284447

50

53

58

79

102

141155 Network plot

Create positions for all of

the nodes w force directed

llt-layoutfruchtermanreingold(G

niter=500)

Set the nodesrsquo size relative to

their residual value

V(G)$sizelt-abs(res)10

Only display the labels of key

players

nodeslt-asvector(V(G)+1)

Key players defined as have a

residual value gt25

nodes[which(abs(res)lt25)]lt-NA

Save plot as PDF

pdf(lsquoactor_plotpdfrsquopointsize=7)

plot(Glayout=lvertexlabel=nodes

vertexlabeldist=025

vertexlabelcolor=lsquoredrsquoedgewidth=1)

devoff()

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Not all edges are created equally

I have spent a lot of time tonight describing network analysis as a way tounderstand relationships

I Depending on the source and context of the data these relationships can be interpreted as many differentthings

I Must consider the data generation process by which the edge was created

I With respect to online social graph data ask how do people use this service

Twitter

Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you

I Geography andhistory less important

I Networks may clusteraround communitiesof interest

Facebook

Ties here are driven bypersonal contactme rarr offline rarr you

I Offline relationshipsare drivingldquofriendingrdquo

I Considerable amountof meta-data alreadyin FB what does thisadd

Google SocialGraph

Connecting all of Googlersquossocial sites togetherme rarr anything rarr you

I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 14: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Not all edges are created equally

I have spent a lot of time tonight describing network analysis as a way tounderstand relationships

I Depending on the source and context of the data these relationships can be interpreted as many differentthings

I Must consider the data generation process by which the edge was created

I With respect to online social graph data ask how do people use this service

Twitter

Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you

I Geography andhistory less important

I Networks may clusteraround communitiesof interest

Facebook

Ties here are driven bypersonal contactme rarr offline rarr you

I Offline relationshipsare drivingldquofriendingrdquo

I Considerable amountof meta-data alreadyin FB what does thisadd

Google SocialGraph

Connecting all of Googlersquossocial sites togetherme rarr anything rarr you

I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 15: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Ethics and network data

Why should we be discusses ethics at a data analytics seminar

ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo

Mark Zuckerburg CEO Facebook

ldquoMoney is a terrible master but anexcellent servantrdquo

ldquoTherersquos a sucker born every minuterdquo

PT Barnum

In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways

I Social networking services are moving previously private data to the public

I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles

I Analysts are caught in the middle There is no IRB on the Internet

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 16: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Edge contextNetwork data and ethics

Examples of ethically questionable network analyses

MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their

friend structureI Problem Extremely private information questionable methods

PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when

people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization

informationI Problem Useful for raising awareness hurtful to anyone who was actually

robbed

Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study

the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to

attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 17: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Where to get social graph data

Recently there has been an explosion of resources for scraping social graph

Service Data API Docs

Following(ers) -replies datetimegeo httpapiwikitwittercom

Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi

All SocialGraph relationships httpcodegooglecomapissocialgraph

Friends Check-ins httpfoursquarecomdevelopers

ldquoTaste graphrdquo recommendations httphunchcomdevelopers

Congressional votes campaign finance httpdevelopernytimescomdocs

There is clearly no shortage of data

I Each service provides different relational context

I Data formats are generally JSON Atom XML or some combination

I For a more extensive list of API resources see HackNY wiki of localstartups

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 18: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Building the social network among LiveJournal users

Using a ldquoseedrdquo user we will build out anetwork

I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata

I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk

I Seed imichaeldotorglivejournalcomI k = 3

I Note the low value of k

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 19: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The code part 1

Loading the libraries and setting things up

from cjson import

from urllib import

from networkx import

from time import

from scipy import arrayunique

if __name__ == __main__

seed_url=lsquolsquohttpimichaeldotorglivejournalcom

sg=get_sg(seed_url)

netnewnodes=create_egonet(sg)

info(net)

Name [lsquohttpimichaeldotorglivejournalcomrsquo]

Type DiGraph

Number of nodes 5

Number of edges 5

Average in degree 10

Average out degree 10

Get the JSON from SocialGraph

def get_sg(seed_url)

sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0

try

furl=urlopen(sgapi_url)

fr=furlread()

furlclose()

return fr

except IOError

print Could not connect to website

print sgapi_url

return

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 20: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build egonet and snowball

Creating the egonet

def create_egonet(s)

try

raw=decode(s)

G=DiGraph()

pendants=[]

n=raw[rsquonodesrsquo]

nk=nkeys()

Gname=str(nk)

pendants=[]

for a in range(0len(nk))

for b in range(0len(nk))

if a=b

Gadd_edge(nk[a]nk[b])

for k in nk

ego=n[k]

ego_out=ego[rsquonodes_referencedrsquo]

for o in ego_out

Gadd_edge(ko)

pendantsappend(o)

ego_in=ego[rsquonodes_referenced_byrsquo]

for i in ego_in

Gadd_edge(ik)

pendantsappend(i)

pendants=array(pendantsdtype=str)

pendantsflatten()

pendants=unique(pendants)

return Gpendants

except DecodeError

except KeyError

Rolling the snowball

def snowball_round(Gseedsmyspace=False)

t0=time()

if myspace

seeds=get_myspace_url(seeds)

sb_data=[]

for s in range(0len(seeds))

s_sg=get_sg(seeds[s])

new_egopen=create_egonet(s_sg)

for p in pen

sb_dataappend(p)

if slt1

sb_net=compose(Gnew_ego)

else

sb_net=compose(new_egosb_net)

del new_ego

if s==round(len(seeds)02)

sb_netname=rsquo20 completersquo

sb_netinfo()

print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())

print rsquorsquo

More time keeping probably a MUCH better way to do this

sb_data=array(sb_data)

sb_dataflatten()

sb_data=unique(sb_data)

sb_netinfo()

return sb_netsb_data

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 21: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

Build the whole network

Step Nodes Edges Mean Degree Density

Seed 5 5 20 025k = 2 75 115 30 002

k = 3 4938 8659 35 36(10minus4)

I Our seed is abnormally isolated with only fourneighbors

I Large jump after first snowball

I Massive structural leap at k = 3

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 22: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Getting dataFirst scrape and build

The full network

To get a feeling for the size of the full network

Drew Conway Mining and Analyzing Online Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration
Page 23: Mining Social Graph Data

Network basicsResearch design

Digging into dataLive demonstration

Live demo

Live demonstration time

Drew Conway Mining and Analyzing Online Social Graph Data

  • Network basics
    • The study of relationships
    • Data representation
    • Analysis amp Visualization
      • Research design
        • Edge context
        • Network data and ethics
          • Digging into data
            • Getting data
            • First scrape and build
              • Live demonstration