36
Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University of Tartu

Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Business Data Analytics

Lecture 8

MTAT.03.319

The slides are available under creative common license. The original owner of these slides is the University of Tartu

Page 2: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Network/Link Analysis

Understanding graph structures in business

settings.

Page 3: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Last Lecture: Remember ?Enron Email Corpus

• This corpus is valuable to computer scientists and social-network theorists in ways that the e-mails’ authors and recipients never could have intended. Because it is a rich example of how real people in a real organization use e-mail—full of mundane lunch plans, boring meeting notes, embarrassing flirtations that revealed at least one extramarital affair, and the damning missives that spelled out corruption—it has become the foundation of hundreds of research studies in fields as diverse as machine learning and workplace gender studies.

• This research has had widespread applications: computer scientists have used the corpus to train systems that automatically prioritize certain messages in an inbox and alert users that they may have forgotten about an important message. Other researchers use the Enron corpus to develop systems that automatically organize or summarize messages. Much of today’s software for fraud detection, counterterrorism operations, and mining workplace behavioral patterns over e-mail has been somehow touched by the data set.

https://www.technologyreview.com/s/515801/the-immortal-life-of-the-enron-e-mails/

Page 4: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Enron Network

Source: Jana Diesner, Kathleen M. Carley. Exploration of Communication Networks from the Enron Email Corpus

Year :2000 Year :2001

Page 5: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Enron Network

Source: Jana Diesner, Kathleen M. Carley. Exploration of Communication Networks from the Enron Email Corpus

Year :2000 Year :2001

Density

# Disconnected components

0.018

96

0.031

39

Page 6: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

16 Families relation(Florence, Italy)

Marriage relation between 16 Important Families in Florence during 15 Century period

Period during which Medici rose to power

Some key marriages were enginerred by “Cosimode' Medici”

Page 7: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

16 Families relation(Florence, Italy)

How do we measure the importance of families ?

Can this network explain how Medici Family rose to power ?

Some key marriages were enginerred by “Cosimode' Medici”

Page 8: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Trading

Source: http://www.cepii.fr/PDF_PUB/wp/2013/wp2013-24.pdf

Page 9: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Mobile Network

Lengthy calls but less users Short calls but to a large # of users

1

2

3

1

2

4

5

3

6

Page 10: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Mobile Network: How to select a influential users?

Lengthy calls but less users Short calls but to a large # of users

1

2

3

1

2

4

5

3

6

Influential user

Page 11: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Social Capital

• Networks of relationships among people who live and work in a particular society, enabling that society to function effectively.

• Social capital refers to an individual’s social network andthe resources embedded within the networks that can benefitthe individual in terms of achieving their goals and facilitatingtheir actions.

• It is context dependent• Company is looking for a sales manager or another java expert.

Page 12: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Customers’ spending behavior• Intersection between social behavior and income levels.

• localities (cell tower areas) with diverse network interactions tend to have higher economic development.

• People with higher diversity in social contacts tend to have higher incomes.

• A second line of investigation has focused on using homophily and social closeness to predict the products of interest to individuals

• Easily available data on prospects, such as demographics and sociographic factors often have limited ability to predict future spending behavior.

• Highly social people are also likely to earn higher wages, find better jobs, and live healthier lives.

• There is growing evidence that social behavior is a fundamental human characteristic that affects multiple aspects of human life.

Source: Vivek K. Singh, Laura Freeman, Bruno Lepri, Alex (Sandy) Pentland. Classifying Spending Behavior using Socio-Mobile Data

Page 13: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Understanding Networks through graph theory

• Terminologies And Basics• Networks can be represented using Graphs, G(N, E).

• Nodes (N): Set of entities

• Examples: • Users in Facebook

• Proteins in protein –protein N/W

• Students in Homework-homework network

• Edges (E): Set of connections.

Page 14: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Categorization Networks

Directed Vs. Undirected

• Directed • Ex: Twitter

• Undirected • Ex: Facebook

Weighted Vs. Unweighted

• Unweighted: All relations are important• Ex: Some streets are more

important than others, based on traffic.

• Weighted: Some are more important than others.• Ex: All relations are equally

important

Page 15: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Local Vs. Global Concepts

Local• Degree

• Centrality measures of nodes

• Local Clustering coefficient

Global• Degree Distribution

• Diameter

• Average Path Length

• Density

• Global Clustering Coefficient

• Communities

• Network Topology/Models

• Network robustness

Page 16: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Measures

• Diameter: greatest distance between any pair of vertices• How stretched is the network

• Average Path Length: finding the shortest path between all pairs of nodes, adding them up, and then dividing by the total number of pairs.• How many hops it takes on an average to reach a message.

• Density: Total edges present /Total Edges in an ideal case (Dyadic measure)

• Clique/Complete Graph: a completely connected network, where all nodes are connected to every other node. These networks are symmetric in that all nodes have in-links and out-links from all others.

Source and must read: https://en.wikipedia.org/wiki/Network_science

Page 17: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Network Structure: Models of networks

Long tail/Zipps law/Barabassi

Page 18: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Network Resilience• Ability to provide and maintain an acceptable level of service in the

face of faults and challenges to normal operation

• A fundamental property for achieving smartness in cities

• Global Efficiency (GE) represents the ability to efficiently exchange information in the network G

• N: total nodes

• Len(sp(vi, vj)): length of the shortest path between vi and vj

• Vulnerability is the flip-side of resilience• limited ability to absorb and react to the strains caused by adverse events

• After the removal of node vi and all the edges incident on it

• Undirected network: neighbors of vi

Page 19: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Local Measure: Centrality Measures

• Degree centrality of a node in a network is the number of links (vertices) incident on the node.

• Closeness centrality determines how "close" a node is to other nodes in a network by measuring the sum of the shortest distances (geodesic paths) between that node and all other nodes in the network.

• Betweenness centrality determines the relative importance of a node by measuring the amount of traffic flowing through that node to other nodes in the network. • This is done by measuring the fraction of paths connecting all pairs of nodes

and containing the node of interest.

Page 20: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Importance of a node

Degree: Number of neighbors (or friends or contacts etc)

Degree: 7

Degree: 1

Example: Maximum influence

Page 21: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Importance of a node: Closeness

Closeness of Ivano is better than Jonathan and Holly

Closeness: Inverse of farness. How far it is from all the other nodes.

Example: fastest spreading

Page 22: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Importance of a node

Betweenness: In how many shortest paths, it is present?

Holly’s betweenness is better than Ivano’s

Example: War, Advertisement (which round about) ?

Page 23: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Importance of a node

Cluster Coefficient: is Cn = 2en/(kn(kn-1)), where kn is the number of neighbors of node n and en is the number of connected pairs between all neighbors of n.

Page 24: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Clustering Coefficient (Triadic measure)

Source: http://qasimpasta.info/data/uploads/sina-2015/calculating-clustering-coefficient.pdf

Page 25: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Group Level

Page 26: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

More Denser

Network Level

Less Denser

Page 27: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Finding Communities

Source: https://www.youtube.com/watch?v=k0uxnVEuuz0

Page 28: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Types of communities

Page 29: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Communities

Page 30: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Communities

• Set of edges in a community are more densely connected with each other compared to the rest of the nodes.

• Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules.

• Examples of community detection algorithm• Louvain: Fast but identify only large or small communities only.

• Walktrap: Random walks tend to remain inside highly dense subpart (of the network)

Additional source: https://en.wikipedia.org/wiki/Modularity_(networks)

Page 31: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Diffusion on top of network

Page 32: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Spread of Economic shock

Page 33: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Affect of a Process

Internal Entity

• Diffusion process happening in a network affecting internal entities.

• Example:• Influence (product, behavior etc)

External Entity

• A diffusion process happening in a network affecting external entity

• Example:• Effect of tweets on stock prices

Page 34: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Marketing: Linear Threshold Model

• Basic Assumptions Each node will be either active or inactive.• Active- has adopted • Inactive- has not adopted

• Tendency to become active increases monotonically, as more neighbors become active.

• Assume nodes can only go from inactive -> active and not inactive->active->inactive.

• Each node has threshold weight (influence rejection)

• If the total weight of the neighbors becomes more than the threshold, node gets influenced.

Source: http://optnetsci.cise.ufl.edu/class/cis6930fa15/Slides/group9.pdf

Page 35: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Tools Summary

Tools Language/Tools Pros Cons

igraph R, Python Can handle large/massive networks.

Python version is weaker than R

NetworkX Python Python Not many functions

Gephi Stand alone Tool Visualization Cannot handle large data

NodeXL Plugin for Microsoft Excel

Easy to download for Twitter, Facebook etc

Cannot handle very large data

Pajek Stand alone Tool Fast (uses C ) Windows

muxViz R Multilayer networks

Additional sources:https://www.kdnuggets.com/2015/06/top-30-social-network-analysis-visualization-tools.htmlhttp://wic.litislab.fr/2010/pdf/Combe_WIVE10.pdf (Table 1)

Page 36: Business Data Analytics · Business Data Analytics Lecture 8 MTAT.03.319 The slides are available under creative common license. The original owner of these slides is the University

Demo time!

https://courses.cs.ut.ee/2018/bda/spring/Main/Practice