29
Building Rich Social Network Data Schema to aid designing, collecting and evaluating social network data Eamonn O’Loughlin, e [email protected] University College Dublin Diane Payne, University College Dublin

Building Rich Social Network Data

Embed Size (px)

Citation preview

Page 1: Building Rich Social Network Data

Building Rich Social Network Data

Schema to aid designing, collecting and evaluating social network data

Eamonn O’Loughlin,[email protected]

University College Dublin

Diane Payne,

University College Dublin

Page 2: Building Rich Social Network Data

Why Social Networks

Social interactions and social networks are an enduring component of our everyday lives.

Social Networks are (among other things):

• Basis upon which information and behaviours diffuse through a population

• Cornerstone for trade and cooperation

• Key component in determining the languages we speak, goals we aspire to, and values we hold

Page 3: Building Rich Social Network Data

Background & MotivationEamonn O’Loughlin: Early stage PhD Researcher in the Dynamic Lab – with an interest in predictive modeling of social behavior using social position and structure. Also interested in large network visualisations and policy design that recognises / leverages network effects.

Motivation: Social Network Analysis techniques to uncover patterns and relationships between network structure/activity and micro network outcomes (individual actions or decisions).

Conclusion

Hypothesis Evaluation

Data StrategyDesign

& Collection

Motivation + Intuition + Problem + Hypothesis

time

Page 4: Building Rich Social Network Data

Background & MotivationIntended Audience:Researchers who are / will be creating a social network dataset

(1) Precautionary: Nobody wants to realise that they didn’t consider some easy-to-collect yet suddenly vital-for-analysis feature after their data has already been collected

(2) Not Straightforward: Social Network data & data design is complex – compared to traditional multi-dimensional data there are many different assumptions that must be made and (as we will see) quite a few trade-offs

Conclusion

Hypothesis Evaluation

Data StrategyDesign

& Collection

Motivation + Intuition + Problem + Hypothesis

time

Today’s focus

(not covered today: domain of analysis & specific domain challenges)

Page 5: Building Rich Social Network Data

What is Social Network Data

“Social network views social relationships in terms of network theory, consisting of nodes (representing individual actors within the network) and ties (which

represent relationships between the individuals”

Page 6: Building Rich Social Network Data

Brief (Subjective) History of Social Network Analysis

J. A. Barnes

Stephen Borgatti

S.D. BerkowitzRonald Burt

Kathleen Carley

Martin Everett

Katherine Faust

Linton Freeman Mark GranovetterDavid Knoke

David Krackhardt

Peter Marsden

Nicholas Mullins

Anatol Rapoport Stanley Wasserman Barry Wellman

Douglas R. White Harrison White

Tom A. B. SnijdersGarry Robins

Nan Lin

Karen Cook

Page 7: Building Rich Social Network Data

Brief (Subjective) History of Social Network Analysis

J. A. Barnes

Stephen Borgatti

S.D. BerkowitzRonald Burt

Kathleen Carley

Martin Everett

Katherine Faust

Linton Freeman Mark GranovetterDavid Knoke

David Krackhardt

Peter Marsden

Nicholas Mullins

Anatol Rapoport Stanley Wasserman Barry Wellman

Douglas R. White Harrison White

Social Capital(& structural holes)

Dynamic Network AnalysisSocial Constructs /

Persistent Social Formations

Diffusion of Innovation

UCINet

‘The Strength of Weak Ties’

(economic networks)

Tom A. B. SnijdersMultilevel

Analysis & SIENA

Statistical Models for Social Networks

Social Networks as

a Science

ERGMs

Garry Robins

Nan Lin

Network Theory of Social Capital

Karen Cook

Exchange & Trust

Page 8: Building Rich Social Network Data

Brief (Subjective) History of Social Network Analysis

J. A. Barnes

Stephen Borgatti

S.D. BerkowitzRonald Burt

Kathleen Carley

Martin Everett

Katherine Faust

Linton Freeman Mark GranovetterDavid Knoke

David Krackhardt

Peter Marsden

Nicholas Mullins

Anatol Rapoport Stanley Wasserman Barry Wellman

Douglas R. White Harrison White

Social Capital(& structural holes)

Dynamic Network AnalysisSocial Constructs /

Persistent Social Formations

Social Network VisualisationDiffusion of

Innovation

Social Networks & the Internet

Network Realism

UCINetCommunication,co-authorship,

and colleagueship

Inter-organisational political networks & Terrorist Networks

‘The Strength of Weak Ties’

(economic networks)

Comparative Network Methods

Tom A. B. SnijdersMultilevel

Analysis & SIENA

Social Structure & Cognition

Social Structures

Statistical Models for Social Networks

Formal Organisations & Social Networks

Consensus Analysis

Social Networks as

a Science

ERGMs

Garry Robins

Nan Lin

Network Theory of Social Capital

Karen Cook

Exchange & Trust

Page 9: Building Rich Social Network Data

Why is this a Problem

Impossible to collect perfect & complete data => trade-offs are being made

Privacy Concern

s

Network Data

Collection is

Expensive

Many design

decisions Different Practitioner

s

Difficult to sample network

data

A social network data schema will help make this process explicit and transparent

Rapid Sensor Tech.

Advancement

Reduced cost of data

storage

Increase in ability to analyse

data

Page 10: Building Rich Social Network Data

‘Traditional’ Dimensional Data

Dimensional Data -vs- Network Data

Cross-Sectional

Panel Data

Time Series

Social Network Data

??No Standard

Representation??

Page 11: Building Rich Social Network Data

What is the Solution

“A schema allows us to represent in a particular way the structure and features of a particular object”

A schema is a mechanism that allows us to define the design, content, and to some extent, the semantics of a dataset.

‘Traditional’ Dimensional Data

Cross-Sectional

Panel Data

Time Series

Social Network Data

….

….

….….

….

….

Page 12: Building Rich Social Network Data

Approach Taken

TBC

Dataset Wiki: http://dl.ucd.ie1. Searched for publically available

social network datasets (20-30 different datasets)

2. Accesses datasets & related publications. Reviewed structure and collection approach

3. Created draft schema

4. Added 110 more datasets to analysis. Refined / iterated schema design

5. Published dataset wiki / solicited input from social network analysis community (INSNA)

6. Completed schema design

Page 13: Building Rich Social Network Data

Schema Overview: StructureSocial Network Data Schema

….

….

….

….

….

…. …. …. ….

…. ….

…. ….

….

Page 14: Building Rich Social Network Data

Schema Overview: Minimal RepresentationSocial Network Data Schema

Node Represents

….

Edge Represents

….

….

…. …. …. ….

…. ….

…. ….

….

Overview:• What does a node represent (Individuals?

Employees? Researchers? Firms? Organisations? Countries? political positions?)

• What does an edge represent (friendship? communication? Interaction?)

Examples:UK MPs on Twitter

(Personal Twitter Accounts)(Mentions)

Co-authorship in network science(Academic Journal Authors)(Co-Authorship)

Infectious SocioPatterns(Visitors to Science Gallery)(face-to-face proximity)

Page 15: Building Rich Social Network Data

Schema Overview: Node TypesSocial Network Data Schema

Node Represents

Is bipartite?

Edge Represents

Multiple Node Types?

….

…. …. …. ….

…. ….

…. ….

….

Overview:• Does the network contain > 1 node types?

• Bipartite networks are a particular class of complex networks, whose nodes are divided into two sets X and Y, and only connections between two nodes in different sets are allowed.

Examples:

Terrorist NetworkNodes Types: Terrorist, Leader,

Politician, Citizen

Primary School Cumulative NetworkNode Types: Teacher, StudentEdge Type: Physical Interaction

between student and teacher

Page 16: Building Rich Social Network Data

Schema Overview: Edge TypesSocial Network Data Schema

Node Represents

Is bipartite?

Edge Represents

Multiple Node Types?

Multiple Edge Types?

=> -=> w +/-

…. ….

…. ….

….

Overview:• Does the network contain > 1 edge types• Are these edges:

• directed?• undirected?• weighted (e.g. strength / frequency)• signed (e.g. positive / negative)

Examples:The Policy Network of Toxic Chemicals Regulation in Germany in the 1980sEdge Types: Shared Committee Membership,

Information ExchangeStudents data sets (van de Bunt)Edge Types: Unknown, best friend, friend, friendly relation, neutral, troubled relation, item non-response, actor non-response

Page 17: Building Rich Social Network Data

Schema Overview: Edge TypesSocial Network Data Schema

Node Represents

Is bipartite?

Edge Represents

Multiple Node Types?

Multiple Edge Types?

=> -=> w +/-

…. ….

…. ….

….

Overview:• Does the network contain > 1 edge types• Are these edges:

• directed?• undirected?• weighted (e.g. strength / frequency)• signed (e.g. positive / negative)

Examples:Enron Email DatasetNodes: Senior Enron EmployeesEdge Types: Email Sent, Email RecievedWeight: # of Emails sent

Dining-table partners in a girls dormitory at a New York State training schoolNodes: Girls in a New York state dormitory Edge Types: preferred dining partnerWeight: order of preference

Page 18: Building Rich Social Network Data

Schema Overview: Node Attributes / Communities

Social Network Data Schema

Node Represents

Is bipartite?

Edge Represents

Multiple Node Types?

Multiple Edge Types?

=> -=> w +/-

Node Attributes Communities

…. ….

….

Overview:• Do Nodes have attributes? • Are these attributes static (e.g. gender) or

dynamic (e.g. smoking preference)?

• Are the nodes belonging to some known community?

Examples:Lawyers data (Lazenga)Node Attributes: seniority, formal status, office in which they work, gender, law school attended, individual performance measurements (hours worked, fees brought in), attitudes concerning management policy

Irish Politicians & Organisations on TwitterCommunities: Political Affiliation (Fine Gael, Fianna Fáil, Labour, Sinn Féin, …)

Page 19: Building Rich Social Network Data

Schema Overview: Dynamic DataSocial Network Data Schema

Node Represents

Is bipartite?

Edge Represents

Multiple Node Types?

Multiple Edge Types?

=> -=> w +/-

Node Attributes Communities

Dynamic ….

….

Overview:• Is the Network Dataset Dynamic?

• If Dynamic, is the type of temporal data:• Event Driven?• Continuous / Realtime?• Periodic Snapshots?

Examples:Kapferer Tailor ShopInteractions recorded at two different time points seven months apart; a strike happened in between (snapshot)

Southern Women NetworkIt contains the observed attendance at 14 social events by 18 Southern women. (event driven)

Page 20: Building Rich Social Network Data

Schema Overview: Dynamic DataSocial Network Data Schema

Node Represents

Is bipartite?

Edge Represents

Multiple Node Types?

Multiple Edge Types?

=> -=> w +/-

Node Attributes Communities

Dynamic ….

….

Overview:• Is the Network Dataset Dynamic?

• If Dynamic, is the type of temporal data:• Event Driven?• Continuous / Realtime?• Periodic Snapshots?

Examples:Norwegian Boards (Aug09)�Board membership evolution from 1999 to 2009 (continuous or real-time)

Page 21: Building Rich Social Network Data

Schema Overview: Parallel DataSocial Network Data Schema

Node Represents

Is bipartite?

Edge Represents

Multiple Node Types?

Multiple Edge Types?

=> -=> w +/-

Node Attributes Communities

Dynamic Parallel

….

Overview:• Does the Network come with Parallel Data?

• Is this parallel data time-series?

• What is the relationship of this parallel data to the network data?

Examples:Wiki-VoteNodes: Wikipedia EditorsEdges: Voting BehaviourParallel Data: Vote outcomeMathSciNet: Co-authorship networkNode: Journal Article AuthorsEdges: Co-authorshipParallel Data: Detailed information about MathSciNet papers: numerical IDs of papers, authors, and categories

Page 22: Building Rich Social Network Data

Schema Overview: Parallel DataSocial Network Data Schema

Node Represents

Is bipartite?

Edge Represents

Multiple Node Types?

Multiple Edge Types?

=> -=> w +/-

Node Attributes Communities

Dynamic Parallel

….

Overview:• Does the Network come with Parallel Data?

• Is this parallel data time-series?

• What is the relationship of this parallel data to the network data?

Examples:Extended Epinions datasetNodes: Consumers on trust site Epinions.comEdges: Trust / DistrustParallel Data: Details of all product reviews hosted on the Epinions website

Page 23: Building Rich Social Network Data

Schema Overview: MetadataSocial Network Data Schema

Node Represents

Is bipartite?

Edge Represents

Multiple Node Types?

Multiple Edge Types?

=> -=> w +/-

Node Attributes Communities

Dynamic Parallel

Collection Metadata

Overview:• What are the network boundry conditions?• Does the network have mising data?

• Does this missing data have a pattern?• Was the data sampled / sub-selected from a

larger dataset?• What was the process for sampling?

Examples:Newcomb Fraternity15 weekly sociometric preference rankings from 17 men attending the University of Michigan in the fall of 1956; data from week 9 are missing.

Enron Email Dataset (Boundary Conditions)

Page 24: Building Rich Social Network Data

Schema Overview: MetadataSocial Network Data Schema

Node Represents

Is bipartite?

Edge Represents

Multiple Node Types?

Multiple Edge Types?

=> -=> w +/-

Node Attributes Communities

Dynamic Parallel

Collection Metadata

Overview:• What are the network boundry conditions?• Does the newwork have mising data?

• Does this missing data have a pattern?• Was the data samples / sub-selected from a

larged dataset?• What was the process for sampling?

Examples:Yahoo! Messenger User Communication PatternDataset contains a small sample of the Yahoo! Messenger community's communication (IM) log at a high level for a period of 4 weeks. Specifically, this dataset only records the first communication event from one user to another on a particular day, and generates such records for a period of 28 days.

Page 25: Building Rich Social Network Data

Social Network Data Schema

Social Network Data Schema(1 page overview)

A schema is a way to define the structure, content, and to some extent, the semantics of a dataset

Node Represents

Is bipartite?

Edge Represents

Multiple Node Types?

Multiple Edge Types?

=> -=> w +/-

Node Attributes Communities

Dynamic Parallel

Collection Metadata

• What does a node represent (Individuals? Employees? Researchers? Firms? Organisations? Countries? political positions?)

• What does an edge represent (friendship? communication? Interaction?)

Eamonn O’Loughlin, Dynamics Lab, UCD ([email protected])

• Does the network contain > 1 node types• Is the network bipartite, where ties can only exist between nodes of two

different groups.

• Does the network contain > 1 edge types• Are these edges:

• directed? / undirected?• weighted (e.g. strength / frequency) or signed (e.g. pos. / neg)

• Do nodes have attributes? / Are these attributes static or dynamic?• Are the nodes belonging to some known community?

• Is the Network Dataset Dynamic?• If Dynamic, is the type of temporal data:

• Event Driven? / Continuous / Realtime? / Periodic Snapshots?

• Boundry Conditions? Missing Data?• Sampled from larger dataset? Sampling

Page 26: Building Rich Social Network Data

Proposed Use of Schema

Direct Observation / Survey

Page 27: Building Rich Social Network Data

Proposed Use of Schema

Retrieving Data (subset) from an existing system

Page 28: Building Rich Social Network Data

Proposed Use of Schema

Identifying / Assessing publically available data

Page 29: Building Rich Social Network Data

Thank You

?Questions

Reach me at [email protected]