Upload
eamonn-oloughlin
View
363
Download
8
Tags:
Embed Size (px)
Citation preview
Building Rich Social Network Data
Schema to aid designing, collecting and evaluating social network data
Eamonn O’Loughlin,[email protected]
University College Dublin
Diane Payne,
University College Dublin
Why Social Networks
Social interactions and social networks are an enduring component of our everyday lives.
Social Networks are (among other things):
• Basis upon which information and behaviours diffuse through a population
• Cornerstone for trade and cooperation
• Key component in determining the languages we speak, goals we aspire to, and values we hold
Background & MotivationEamonn O’Loughlin: Early stage PhD Researcher in the Dynamic Lab – with an interest in predictive modeling of social behavior using social position and structure. Also interested in large network visualisations and policy design that recognises / leverages network effects.
Motivation: Social Network Analysis techniques to uncover patterns and relationships between network structure/activity and micro network outcomes (individual actions or decisions).
Conclusion
Hypothesis Evaluation
Data StrategyDesign
& Collection
Motivation + Intuition + Problem + Hypothesis
time
Background & MotivationIntended Audience:Researchers who are / will be creating a social network dataset
(1) Precautionary: Nobody wants to realise that they didn’t consider some easy-to-collect yet suddenly vital-for-analysis feature after their data has already been collected
(2) Not Straightforward: Social Network data & data design is complex – compared to traditional multi-dimensional data there are many different assumptions that must be made and (as we will see) quite a few trade-offs
Conclusion
Hypothesis Evaluation
Data StrategyDesign
& Collection
Motivation + Intuition + Problem + Hypothesis
time
Today’s focus
(not covered today: domain of analysis & specific domain challenges)
What is Social Network Data
“Social network views social relationships in terms of network theory, consisting of nodes (representing individual actors within the network) and ties (which
represent relationships between the individuals”
Brief (Subjective) History of Social Network Analysis
J. A. Barnes
Stephen Borgatti
S.D. BerkowitzRonald Burt
Kathleen Carley
Martin Everett
Katherine Faust
Linton Freeman Mark GranovetterDavid Knoke
David Krackhardt
Peter Marsden
Nicholas Mullins
Anatol Rapoport Stanley Wasserman Barry Wellman
Douglas R. White Harrison White
Tom A. B. SnijdersGarry Robins
Nan Lin
Karen Cook
Brief (Subjective) History of Social Network Analysis
J. A. Barnes
Stephen Borgatti
S.D. BerkowitzRonald Burt
Kathleen Carley
Martin Everett
Katherine Faust
Linton Freeman Mark GranovetterDavid Knoke
David Krackhardt
Peter Marsden
Nicholas Mullins
Anatol Rapoport Stanley Wasserman Barry Wellman
Douglas R. White Harrison White
Social Capital(& structural holes)
Dynamic Network AnalysisSocial Constructs /
Persistent Social Formations
Diffusion of Innovation
UCINet
‘The Strength of Weak Ties’
(economic networks)
Tom A. B. SnijdersMultilevel
Analysis & SIENA
Statistical Models for Social Networks
Social Networks as
a Science
ERGMs
Garry Robins
Nan Lin
Network Theory of Social Capital
Karen Cook
Exchange & Trust
Brief (Subjective) History of Social Network Analysis
J. A. Barnes
Stephen Borgatti
S.D. BerkowitzRonald Burt
Kathleen Carley
Martin Everett
Katherine Faust
Linton Freeman Mark GranovetterDavid Knoke
David Krackhardt
Peter Marsden
Nicholas Mullins
Anatol Rapoport Stanley Wasserman Barry Wellman
Douglas R. White Harrison White
Social Capital(& structural holes)
Dynamic Network AnalysisSocial Constructs /
Persistent Social Formations
Social Network VisualisationDiffusion of
Innovation
Social Networks & the Internet
Network Realism
UCINetCommunication,co-authorship,
and colleagueship
Inter-organisational political networks & Terrorist Networks
‘The Strength of Weak Ties’
(economic networks)
Comparative Network Methods
Tom A. B. SnijdersMultilevel
Analysis & SIENA
Social Structure & Cognition
Social Structures
Statistical Models for Social Networks
Formal Organisations & Social Networks
Consensus Analysis
Social Networks as
a Science
ERGMs
Garry Robins
Nan Lin
Network Theory of Social Capital
Karen Cook
Exchange & Trust
Why is this a Problem
Impossible to collect perfect & complete data => trade-offs are being made
Privacy Concern
s
Network Data
Collection is
Expensive
Many design
decisions Different Practitioner
s
Difficult to sample network
data
A social network data schema will help make this process explicit and transparent
Rapid Sensor Tech.
Advancement
Reduced cost of data
storage
Increase in ability to analyse
data
‘Traditional’ Dimensional Data
Dimensional Data -vs- Network Data
Cross-Sectional
Panel Data
Time Series
Social Network Data
??No Standard
Representation??
What is the Solution
“A schema allows us to represent in a particular way the structure and features of a particular object”
A schema is a mechanism that allows us to define the design, content, and to some extent, the semantics of a dataset.
‘Traditional’ Dimensional Data
Cross-Sectional
Panel Data
Time Series
Social Network Data
….
….
….….
….
….
Approach Taken
TBC
Dataset Wiki: http://dl.ucd.ie1. Searched for publically available
social network datasets (20-30 different datasets)
2. Accesses datasets & related publications. Reviewed structure and collection approach
3. Created draft schema
4. Added 110 more datasets to analysis. Refined / iterated schema design
5. Published dataset wiki / solicited input from social network analysis community (INSNA)
6. Completed schema design
Schema Overview: StructureSocial Network Data Schema
….
….
….
….
….
…. …. …. ….
…. ….
…. ….
….
Schema Overview: Minimal RepresentationSocial Network Data Schema
Node Represents
….
Edge Represents
….
….
…. …. …. ….
…. ….
…. ….
….
Overview:• What does a node represent (Individuals?
Employees? Researchers? Firms? Organisations? Countries? political positions?)
• What does an edge represent (friendship? communication? Interaction?)
Examples:UK MPs on Twitter
(Personal Twitter Accounts)(Mentions)
Co-authorship in network science(Academic Journal Authors)(Co-Authorship)
Infectious SocioPatterns(Visitors to Science Gallery)(face-to-face proximity)
Schema Overview: Node TypesSocial Network Data Schema
Node Represents
Is bipartite?
Edge Represents
Multiple Node Types?
….
…. …. …. ….
…. ….
…. ….
….
Overview:• Does the network contain > 1 node types?
• Bipartite networks are a particular class of complex networks, whose nodes are divided into two sets X and Y, and only connections between two nodes in different sets are allowed.
Examples:
Terrorist NetworkNodes Types: Terrorist, Leader,
Politician, Citizen
Primary School Cumulative NetworkNode Types: Teacher, StudentEdge Type: Physical Interaction
between student and teacher
Schema Overview: Edge TypesSocial Network Data Schema
Node Represents
Is bipartite?
Edge Represents
Multiple Node Types?
Multiple Edge Types?
=> -=> w +/-
…. ….
…. ….
….
Overview:• Does the network contain > 1 edge types• Are these edges:
• directed?• undirected?• weighted (e.g. strength / frequency)• signed (e.g. positive / negative)
Examples:The Policy Network of Toxic Chemicals Regulation in Germany in the 1980sEdge Types: Shared Committee Membership,
Information ExchangeStudents data sets (van de Bunt)Edge Types: Unknown, best friend, friend, friendly relation, neutral, troubled relation, item non-response, actor non-response
Schema Overview: Edge TypesSocial Network Data Schema
Node Represents
Is bipartite?
Edge Represents
Multiple Node Types?
Multiple Edge Types?
=> -=> w +/-
…. ….
…. ….
….
Overview:• Does the network contain > 1 edge types• Are these edges:
• directed?• undirected?• weighted (e.g. strength / frequency)• signed (e.g. positive / negative)
Examples:Enron Email DatasetNodes: Senior Enron EmployeesEdge Types: Email Sent, Email RecievedWeight: # of Emails sent
Dining-table partners in a girls dormitory at a New York State training schoolNodes: Girls in a New York state dormitory Edge Types: preferred dining partnerWeight: order of preference
Schema Overview: Node Attributes / Communities
Social Network Data Schema
Node Represents
Is bipartite?
Edge Represents
Multiple Node Types?
Multiple Edge Types?
=> -=> w +/-
Node Attributes Communities
…. ….
….
Overview:• Do Nodes have attributes? • Are these attributes static (e.g. gender) or
dynamic (e.g. smoking preference)?
• Are the nodes belonging to some known community?
Examples:Lawyers data (Lazenga)Node Attributes: seniority, formal status, office in which they work, gender, law school attended, individual performance measurements (hours worked, fees brought in), attitudes concerning management policy
Irish Politicians & Organisations on TwitterCommunities: Political Affiliation (Fine Gael, Fianna Fáil, Labour, Sinn Féin, …)
Schema Overview: Dynamic DataSocial Network Data Schema
Node Represents
Is bipartite?
Edge Represents
Multiple Node Types?
Multiple Edge Types?
=> -=> w +/-
Node Attributes Communities
Dynamic ….
….
Overview:• Is the Network Dataset Dynamic?
• If Dynamic, is the type of temporal data:• Event Driven?• Continuous / Realtime?• Periodic Snapshots?
Examples:Kapferer Tailor ShopInteractions recorded at two different time points seven months apart; a strike happened in between (snapshot)
Southern Women NetworkIt contains the observed attendance at 14 social events by 18 Southern women. (event driven)
Schema Overview: Dynamic DataSocial Network Data Schema
Node Represents
Is bipartite?
Edge Represents
Multiple Node Types?
Multiple Edge Types?
=> -=> w +/-
Node Attributes Communities
Dynamic ….
….
Overview:• Is the Network Dataset Dynamic?
• If Dynamic, is the type of temporal data:• Event Driven?• Continuous / Realtime?• Periodic Snapshots?
Examples:Norwegian Boards (Aug09)�Board membership evolution from 1999 to 2009 (continuous or real-time)
Schema Overview: Parallel DataSocial Network Data Schema
Node Represents
Is bipartite?
Edge Represents
Multiple Node Types?
Multiple Edge Types?
=> -=> w +/-
Node Attributes Communities
Dynamic Parallel
….
Overview:• Does the Network come with Parallel Data?
• Is this parallel data time-series?
• What is the relationship of this parallel data to the network data?
Examples:Wiki-VoteNodes: Wikipedia EditorsEdges: Voting BehaviourParallel Data: Vote outcomeMathSciNet: Co-authorship networkNode: Journal Article AuthorsEdges: Co-authorshipParallel Data: Detailed information about MathSciNet papers: numerical IDs of papers, authors, and categories
Schema Overview: Parallel DataSocial Network Data Schema
Node Represents
Is bipartite?
Edge Represents
Multiple Node Types?
Multiple Edge Types?
=> -=> w +/-
Node Attributes Communities
Dynamic Parallel
….
Overview:• Does the Network come with Parallel Data?
• Is this parallel data time-series?
• What is the relationship of this parallel data to the network data?
Examples:Extended Epinions datasetNodes: Consumers on trust site Epinions.comEdges: Trust / DistrustParallel Data: Details of all product reviews hosted on the Epinions website
Schema Overview: MetadataSocial Network Data Schema
Node Represents
Is bipartite?
Edge Represents
Multiple Node Types?
Multiple Edge Types?
=> -=> w +/-
Node Attributes Communities
Dynamic Parallel
Collection Metadata
Overview:• What are the network boundry conditions?• Does the network have mising data?
• Does this missing data have a pattern?• Was the data sampled / sub-selected from a
larger dataset?• What was the process for sampling?
Examples:Newcomb Fraternity15 weekly sociometric preference rankings from 17 men attending the University of Michigan in the fall of 1956; data from week 9 are missing.
Enron Email Dataset (Boundary Conditions)
Schema Overview: MetadataSocial Network Data Schema
Node Represents
Is bipartite?
Edge Represents
Multiple Node Types?
Multiple Edge Types?
=> -=> w +/-
Node Attributes Communities
Dynamic Parallel
Collection Metadata
Overview:• What are the network boundry conditions?• Does the newwork have mising data?
• Does this missing data have a pattern?• Was the data samples / sub-selected from a
larged dataset?• What was the process for sampling?
Examples:Yahoo! Messenger User Communication PatternDataset contains a small sample of the Yahoo! Messenger community's communication (IM) log at a high level for a period of 4 weeks. Specifically, this dataset only records the first communication event from one user to another on a particular day, and generates such records for a period of 28 days.
Social Network Data Schema
Social Network Data Schema(1 page overview)
A schema is a way to define the structure, content, and to some extent, the semantics of a dataset
Node Represents
Is bipartite?
Edge Represents
Multiple Node Types?
Multiple Edge Types?
=> -=> w +/-
Node Attributes Communities
Dynamic Parallel
Collection Metadata
• What does a node represent (Individuals? Employees? Researchers? Firms? Organisations? Countries? political positions?)
• What does an edge represent (friendship? communication? Interaction?)
Eamonn O’Loughlin, Dynamics Lab, UCD ([email protected])
• Does the network contain > 1 node types• Is the network bipartite, where ties can only exist between nodes of two
different groups.
• Does the network contain > 1 edge types• Are these edges:
• directed? / undirected?• weighted (e.g. strength / frequency) or signed (e.g. pos. / neg)
• Do nodes have attributes? / Are these attributes static or dynamic?• Are the nodes belonging to some known community?
• Is the Network Dataset Dynamic?• If Dynamic, is the type of temporal data:
• Event Driven? / Continuous / Realtime? / Periodic Snapshots?
• Boundry Conditions? Missing Data?• Sampled from larger dataset? Sampling
Proposed Use of Schema
Direct Observation / Survey
Proposed Use of Schema
Retrieving Data (subset) from an existing system
Proposed Use of Schema
Identifying / Assessing publically available data