38
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Analytics Building Blocks Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

http://poloclub.gatech.edu/cse6242CSE6242 / CX4242: Data & Visual Analytics

Analytics Building Blocks

Duen Horng (Polo) Chau Assistant ProfessorAssociate Director, MS AnalyticsGeorgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Page 2: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

2

What is Data & Visual Analytics?

Page 3: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

2

What is Data & Visual Analytics?

No formal definition!

Page 4: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

2

Polo’s definition: the interdisciplinary science of combining computation techniques and interactive visualization to transform and model data to aid discovery, decision making, etc.

What is Data & Visual Analytics?

No formal definition!

Page 5: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

3

What are the “ingredients”?

Page 6: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

3

What are the “ingredients”?

Need to worry (a lot) about: storage, complex system design, scalability of algorithms, visualization techniques, interaction techniques, statistical tests, etc.

Wasn’t this complex before this big data era. Why?

Page 7: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

4http://spanning.com/blog/choosing-between-storage-based-and-unlimited-storage-for-cloud-data-backup/

Page 8: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

What is big data? Why should you care?(“big data” is buzz word, so is “IoT” - Internet of Things)

• large amount of info

• Healthcare (EHRs; mobile health sensor data)

• personal history; sickness

• Sports (baseball, basketball, soccer)

• Navigation and maps

• Business (“need analysis”: infer what clients want)

• Social media

• Sensor networks

• stock market (high frequency trading)

• supply chain logistics

Page 9: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

(From previous class)What is big data? Why care?

• Many companies’ businesses are based on big data (Google, Facebook, Amazon, Apple, Symantec, LinkedIn, and many more)

• Web search

• Rank webpages (PageRank algorithm)

• Predict what you’re going to type

• Advertisement (e.g., on Facebook)

• Infer users’ interest; show relevant ads

• Infer what you like, based on what your friends like

• Recommendation systems (e.g., Netflix, Pandora, Amazon)

• Online education

• Health IT: patient records (EMR)

• Bio and Chemical modeling:

• Finance

• Cybersecruity

• Internet of Things (IoT)

Page 10: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Good news! Many jobs!Most companies are looking for “data scientists”

The data scientist role is critical for organizations looking to extract insight from information assets for ‘big data’ initiatives and requires a broad combination of skills that may be fulfilled better as a team- Gartner (http://www.gartner.com/it-glossary/data-scientist)

Breadth of knowledge is important.

This course helps you learn some important skills.

Page 11: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Analytics Building Blocks

Page 12: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Page 13: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Building blocks, not “steps”• Can skip some

• Can go back (two-way street)

• Examples

• Data types inform visualization design

• Data informs choice of algorithms

• Visualization informs data cleaning (dirty data)

• Visualization informs algorithm design (user finds that results don’t make sense)

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Page 14: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

How big data affects the process?The Vs of big data (3Vs, 4Vs, now 7Vs)

Volume: “billions”, “petabytes” are common

Velocity: think Twitter, fraud detection, etc.

Variety: text (webpages), video (youtube)…

Veracity: uncertainty of data

Variability

Visualization

Value

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Disseminationhttp://www.ibmbigdatahub.com/infographic/four-vs-big-data

http://dataconomy.com/seven-vs-big-data/

Page 15: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Gartner's 2015 Hype Cycle http://www.gartner.com/newsroom/id/3114217

Page 16: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Schedule

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Page 17: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Two Example Projects (from our research)

Page 18: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

NetProbe: Fraud Detection in Online Auction

NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks. Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Page 19: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Find bad sellers (fraudsters) on eBay who don’t deliver their items

NetProbe: The Problem

Buyer

$$$

Seller

16

Auction fraud is #3 online crime in 2010source: www.ic3.gov

Page 20: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

17

Page 21: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

NetProbe: Key Ideas§ Fraudsters fabricate their reputation by

“trading” with their accomplices§ Fake transactions form near bipartite cores§ How to detect them?

18

Page 22: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

NetProbe: Key IdeasUse Belief Propagation

19

F A HFraudster

AccompliceHonest

Darker means more likely

Page 23: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

NetProbe: Main Results

20

Page 24: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

21

Page 25: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

21

Page 26: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

21

“Belgian Police”

Page 27: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

22

Page 28: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

What did NetProbe go through?

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Scraping (built a “scraper”/“crawler”)

Design detection algorithm

Not released

Paper, talks, lectures

Page 29: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Apolo Graph Exploration: Machine Learning + Visualization

24

Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning. Duen Horng (Polo) Chau, Aniket Kittur, Jason I. Hong, Christos Faloutsos. CHI 2011.

Page 30: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Finding More Relevant Nodes

HCIPaper

Data MiningPaper

Citation network

25

Page 31: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Finding More Relevant Nodes

HCIPaper

Data MiningPaper

Citation network

25

Page 32: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Finding More Relevant Nodes

Apolo uses guilt-by-association(Belief Propagation)

HCIPaper

Data MiningPaper

Citation network

25

Page 33: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Demo: Mapping the Sensemaking Literature

26

Nodes: 80k papers from Google Scholar (node size: #citation) Edges: 150k citations

Page 34: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007
Page 35: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007
Page 36: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Key Ideas (Recap)Specify exemplarsFind other relevant nodes (BP)

28

Page 37: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

What did Apolo go through?

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Scrape Google Scholar. No API :(

Design inference algorithm (Which nodes to show next?)

Paper, talks, lectures

Interactive visualization you just saw

Page 38: Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Homework 1 (out next week; tasks subject to change)

• Simple “End-to-end” analysis

• Collect data using API)

• Movies (Actors, directors, related movies, etc.)

• Store in SQLite database

• Transform data to movie-movie network

• Analyze, using SQL queries (e.g., create graph’s degree distribution)

• Visualize, using Gephi

• Describe your discoveries

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination