Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Business Data Analytics
Lecture 8
MTAT.03.319
The slides are available under creative common license. The original owner of these slides is the University of Tartu
Network/Link Analysis
Understanding graph structures in business
settings.
Last Lecture: Remember ?Enron Email Corpus
• This corpus is valuable to computer scientists and social-network theorists in ways that the e-mails’ authors and recipients never could have intended. Because it is a rich example of how real people in a real organization use e-mail—full of mundane lunch plans, boring meeting notes, embarrassing flirtations that revealed at least one extramarital affair, and the damning missives that spelled out corruption—it has become the foundation of hundreds of research studies in fields as diverse as machine learning and workplace gender studies.
• This research has had widespread applications: computer scientists have used the corpus to train systems that automatically prioritize certain messages in an inbox and alert users that they may have forgotten about an important message. Other researchers use the Enron corpus to develop systems that automatically organize or summarize messages. Much of today’s software for fraud detection, counterterrorism operations, and mining workplace behavioral patterns over e-mail has been somehow touched by the data set.
https://www.technologyreview.com/s/515801/the-immortal-life-of-the-enron-e-mails/
Enron Network
Source: Jana Diesner, Kathleen M. Carley. Exploration of Communication Networks from the Enron Email Corpus
Year :2000 Year :2001
Enron Network
Source: Jana Diesner, Kathleen M. Carley. Exploration of Communication Networks from the Enron Email Corpus
Year :2000 Year :2001
Density
# Disconnected components
0.018
96
0.031
39
16 Families relation(Florence, Italy)
Marriage relation between 16 Important Families in Florence during 15 Century period
Period during which Medici rose to power
Some key marriages were enginerred by “Cosimode' Medici”
16 Families relation(Florence, Italy)
How do we measure the importance of families ?
Can this network explain how Medici Family rose to power ?
Some key marriages were enginerred by “Cosimode' Medici”
Trading
Source: http://www.cepii.fr/PDF_PUB/wp/2013/wp2013-24.pdf
Mobile Network
Lengthy calls but less users Short calls but to a large # of users
1
2
3
1
2
4
5
3
6
Mobile Network: How to select a influential users?
Lengthy calls but less users Short calls but to a large # of users
1
2
3
1
2
4
5
3
6
Influential user
Social Capital
• Networks of relationships among people who live and work in a particular society, enabling that society to function effectively.
• Social capital refers to an individual’s social network andthe resources embedded within the networks that can benefitthe individual in terms of achieving their goals and facilitatingtheir actions.
• It is context dependent• Company is looking for a sales manager or another java expert.
Customers’ spending behavior• Intersection between social behavior and income levels.
• localities (cell tower areas) with diverse network interactions tend to have higher economic development.
• People with higher diversity in social contacts tend to have higher incomes.
• A second line of investigation has focused on using homophily and social closeness to predict the products of interest to individuals
• Easily available data on prospects, such as demographics and sociographic factors often have limited ability to predict future spending behavior.
• Highly social people are also likely to earn higher wages, find better jobs, and live healthier lives.
• There is growing evidence that social behavior is a fundamental human characteristic that affects multiple aspects of human life.
Source: Vivek K. Singh, Laura Freeman, Bruno Lepri, Alex (Sandy) Pentland. Classifying Spending Behavior using Socio-Mobile Data
Understanding Networks through graph theory
• Terminologies And Basics• Networks can be represented using Graphs, G(N, E).
• Nodes (N): Set of entities
• Examples: • Users in Facebook
• Proteins in protein –protein N/W
• Students in Homework-homework network
• Edges (E): Set of connections.
Categorization Networks
Directed Vs. Undirected
• Directed • Ex: Twitter
• Undirected • Ex: Facebook
Weighted Vs. Unweighted
• Unweighted: All relations are important• Ex: Some streets are more
important than others, based on traffic.
• Weighted: Some are more important than others.• Ex: All relations are equally
important
Local Vs. Global Concepts
Local• Degree
• Centrality measures of nodes
• Local Clustering coefficient
Global• Degree Distribution
• Diameter
• Average Path Length
• Density
• Global Clustering Coefficient
• Communities
• Network Topology/Models
• Network robustness
Measures
• Diameter: greatest distance between any pair of vertices• How stretched is the network
• Average Path Length: finding the shortest path between all pairs of nodes, adding them up, and then dividing by the total number of pairs.• How many hops it takes on an average to reach a message.
• Density: Total edges present /Total Edges in an ideal case (Dyadic measure)
• Clique/Complete Graph: a completely connected network, where all nodes are connected to every other node. These networks are symmetric in that all nodes have in-links and out-links from all others.
Source and must read: https://en.wikipedia.org/wiki/Network_science
Network Structure: Models of networks
Long tail/Zipps law/Barabassi
Network Resilience• Ability to provide and maintain an acceptable level of service in the
face of faults and challenges to normal operation
• A fundamental property for achieving smartness in cities
• Global Efficiency (GE) represents the ability to efficiently exchange information in the network G
• N: total nodes
• Len(sp(vi, vj)): length of the shortest path between vi and vj
• Vulnerability is the flip-side of resilience• limited ability to absorb and react to the strains caused by adverse events
• After the removal of node vi and all the edges incident on it
• Undirected network: neighbors of vi
Local Measure: Centrality Measures
• Degree centrality of a node in a network is the number of links (vertices) incident on the node.
• Closeness centrality determines how "close" a node is to other nodes in a network by measuring the sum of the shortest distances (geodesic paths) between that node and all other nodes in the network.
• Betweenness centrality determines the relative importance of a node by measuring the amount of traffic flowing through that node to other nodes in the network. • This is done by measuring the fraction of paths connecting all pairs of nodes
and containing the node of interest.
Importance of a node
Degree: Number of neighbors (or friends or contacts etc)
Degree: 7
Degree: 1
Example: Maximum influence
Importance of a node: Closeness
Closeness of Ivano is better than Jonathan and Holly
Closeness: Inverse of farness. How far it is from all the other nodes.
Example: fastest spreading
Importance of a node
Betweenness: In how many shortest paths, it is present?
Holly’s betweenness is better than Ivano’s
Example: War, Advertisement (which round about) ?
Importance of a node
Cluster Coefficient: is Cn = 2en/(kn(kn-1)), where kn is the number of neighbors of node n and en is the number of connected pairs between all neighbors of n.
Clustering Coefficient (Triadic measure)
Source: http://qasimpasta.info/data/uploads/sina-2015/calculating-clustering-coefficient.pdf
Group Level
More Denser
Network Level
Less Denser
Finding Communities
Source: https://www.youtube.com/watch?v=k0uxnVEuuz0
Types of communities
Communities
Communities
• Set of edges in a community are more densely connected with each other compared to the rest of the nodes.
• Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules.
• Examples of community detection algorithm• Louvain: Fast but identify only large or small communities only.
• Walktrap: Random walks tend to remain inside highly dense subpart (of the network)
Additional source: https://en.wikipedia.org/wiki/Modularity_(networks)
Diffusion on top of network
Spread of Economic shock
Affect of a Process
Internal Entity
• Diffusion process happening in a network affecting internal entities.
• Example:• Influence (product, behavior etc)
External Entity
• A diffusion process happening in a network affecting external entity
• Example:• Effect of tweets on stock prices
Marketing: Linear Threshold Model
• Basic Assumptions Each node will be either active or inactive.• Active- has adopted • Inactive- has not adopted
• Tendency to become active increases monotonically, as more neighbors become active.
• Assume nodes can only go from inactive -> active and not inactive->active->inactive.
• Each node has threshold weight (influence rejection)
• If the total weight of the neighbors becomes more than the threshold, node gets influenced.
Source: http://optnetsci.cise.ufl.edu/class/cis6930fa15/Slides/group9.pdf
Tools Summary
Tools Language/Tools Pros Cons
igraph R, Python Can handle large/massive networks.
Python version is weaker than R
NetworkX Python Python Not many functions
Gephi Stand alone Tool Visualization Cannot handle large data
NodeXL Plugin for Microsoft Excel
Easy to download for Twitter, Facebook etc
Cannot handle very large data
Pajek Stand alone Tool Fast (uses C ) Windows
muxViz R Multilayer networks
Additional sources:https://www.kdnuggets.com/2015/06/top-30-social-network-analysis-visualization-tools.htmlhttp://wic.litislab.fr/2010/pdf/Combe_WIVE10.pdf (Table 1)
Demo time!
https://courses.cs.ut.ee/2018/bda/spring/Main/Practice