34
Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks Dr. Aloke Guha 29th IEEE Conference on Massive Data Storage May 8 th , 2013 [email protected]

Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks

Dr. Aloke Guha

29th IEEE Conference on Massive Data Storage

May 8th, 2013

[email protected]

Page 2: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

2

What’s Common Between a Sensor that could Distinguish a fine Cognac, and Predicting Movies You’d Like on Netflix?

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Page 3: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

The Sommelier “Robot”

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 3

Page 4: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Predicting What Movies You’d Watch

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 4

Page 5: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

5

(Analytics, BigData, DataStore)+

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Page 6: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

6

Many Analytics Techniques . . .

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Statistics

Regression Linear

Time-Series

Decision Trees

R

AI (McCarthy) 1956

Expert Systems

Machine Learning

Neural Networks

SVM LDA

Naïve Bayes K-nearest

neighbor Random Forests

. . . Genetic

Algorithms

Random Forests

SNARC (Minsky) 1951

Dendral (Feigenbaum) 1965

Fraser and Burnell (1970)

. . . Vapnik (1992)

Ihaka and Gentleman (1993)

Page 7: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

7

Common Analytics Processing pre-2000

• Sources: Local

• Data: Numeric, Homogeneous

• Processing: Local

• Consumer: Local

• Analytics: Linear/Non-Linear Regression, Neural Networks, SVM, LDA, LSA, Decision Trees, Monte Carlo, Lin-Ops, Expert Systems . . .

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Page 8: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Flavor Predictor – Neural Networks

USPTO #5,373,452 (1994) 1988

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 8

Page 9: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Pattern Recognition – Genetic Algorithms

US PTO #5,140,530, 1992

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 9

Page 10: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

10

Small to Big

http://article.wn.com/view/2013/04/04/Big_data_forefather_Michael_Stonebraker_shows_no_signs_of_sl/#/related_news

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Page 11: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

11

Typical Analytics: 2000-2006

• Sources: Global , Social Networks

• Data: Heterogeneous, Numeric, Text

• Processing: Hosted/Scale

• Consumer: Global

• Analytics: Batch Mode, Social Media Marketing, Churn Detection, Sentiment Analysis, etc.

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Page 12: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

2007- : Internet Data Analytics

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 12

Page 13: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Financial Risk Scoring: Detect

Risk Scoring: detect incremental change in # occurrences where corporate officers

mention “risk” (or equivalent terms) during earnings call

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 13

Page 14: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Financial Risk Scoring: Listen

*Risk Scoring: detect incremental change in occurrences where corporate officers

mention “risk” (or semantically equivalent terms) during the corporate earnings call

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 14

Page 15: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Banking: Credit Worthiness – remember 2008?

Analyze bank reports to assess loans, payments, recoveries, etc. for key bank

indexes, groups of banks, or individual banks

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 15

Page 16: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Share of Voice: Online Buzz

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 16

Page 17: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Sentiment Analysis

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 17

Page 18: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

18

Analytics Processing: 2007-

• Sources: Global, Mobile, New Social (Instagram, . . )

• Data: Multi-Dimensional, Heterogeneous, Audio/Video

• Processing: Hosted/Scale

• Consumer: Global

• Analytics: Batch, Streaming, . . .

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Page 19: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

2008 - : Real-Time/Streaming Analytics

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 19

Page 20: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Brand Marketing

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 20

Page 21: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Brand Management

21

Page 22: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Customer Support

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 22

Page 23: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Customer Support

23

Page 24: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

24

Lead Generation

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Page 25: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

. . . More Data, Faster

http://www.cioinsight.com/it-strategy/big-data/data-analytics-allows-pg-to-turn-on-a-dime/?kc=CIOMINUTE05062013CIOA

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 25

Page 26: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

“Internet of Things”

http://www.news-sap.com/survey-by-sap-and-harris-interactive-finds-brazil-china-germany-and-india-most-ready-for-

m2m-technology-to-drive-connected-smarter-cities/

Message Queuing Telemetry Transport

Machine-to-Machine

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 26

Page 27: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

27

AumniData: Batch Processing

Data Collector (Batch Scheduled)

Twitter Blog/Web Site

Data Collector (Batch Scheduled)

RSS/ATOM

Feed Requestor/

URL Scanner

NLP+ Cruxly Intent

Detection (AWS)

NLP+ Cruxly Intent

Detection (AWS)

NLP+ Cruxly Intent

Detection (AWS)

NLP+ Cruxly Intent

Detection (AWS)

NLP+ Cruxly Intent

Detection (AWS)

NLP Stack+ AumniData

Classifier + Analytics*

(RackSpace VM)

Dashboard

Application (.3rd party App)

Blog/Web Site

Blog/Web Site YouTube

Dashboard

Configuration (TomCat)

Custom Analytics

Display Ad-Hoc Query

Summary

Data Collector (Batch Scheduled)

Content

Store

Content /

Metadata

Index

(MySQL)

Dashboard

Store

(SQL Server)

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Page 28: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

28

Cruxly: Stream Processing

Streaming API Client (Heroku Worker)

(24x7)

Streaming API Client (Heroku Worker)

(24x7)

NLP+ Cruxly Intent

Detection (AWS)

Streaming API Client (Heroku Worker)

(24x7)

Tweets

(Keywords)

Request

(Keywords)

Tweets

(Keywords) Tweet ID + Intent

Signal

(Heroku

PostgresSQL)

Tweets

Content Store

(DynamoDB)

NLP+ Cruxly Intent

Detection (AWS)

NLP+ Cruxly Intent

Detection (AWS)

NLP+ Cruxly Intent

Detection (AWS)

NLP+ Cruxly Intent

Detection (AWS)

NLP (NER, etc + Cruxly

Intent Detection (AWS)

Reports / Dashboard

Tracker Editor (web app - Heroku)

Twitter

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Page 29: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

29

Data Analytics Demands . . .

Store

Process

Analyze

View

Store

Process

Analyze

View

Storm

Data Collector Text / Sensor Data/ Stream . . .

NLP Classify

Index

Query/ RT Query Ad Hoc/ Search/ SQL

Custom Analytics

Dashboards Chart

Report

Machine

Learning

Library

Stats

Library

R

Yarn

Page 30: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Storage Implications: Back to the Future

MB/s – Batch

IOPs – Stream

Both?

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 30

Page 31: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

Storage Implications: Back to the Future II, III

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Task tracker

Task tracker

Task tracker

Job Tracker

Zookeeper

Hive

Pig

Oozie

HUE

HDFS client Data Node Data Node Data Node

Name Node

Ma

pR

ed

uce

H

DF

S

Master Slave #1 Slave #N Mgmt Node

Storage Capacity Scaling?

31

Storage Tiering?

Import/Export Data?

Page 32: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

A More General Data Analytics Framework?

Data Ingesters (Basic)

Data Ingesters (Smart)

Content Store Metadata / In-Mem Store

Processing Stream and Batch

Data Ingesters

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

An

alyt

ics

Pro

cess

ing

Sen

sor

Pro

cess

ing:

Dat

a In

tegr

atio

n

Vis

ual

izat

ion

Lib

rary

/ In

tera

ctiv

e Q

ue

ry

Loca

l Sto

rage

/ Fl

ash

/ D

AS

SA

N

Map

Re

du

ce /

Dis

trib

ute

d D

ata

Sto

re

32

Page 33: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

33

Conclusion

• Data Analytics Big Data Scale-Out

• Variety Infrastructure

• Volume Bandwidth Support

• Velocity Streaming Support

• We Solved the Processing Problem

• We Need to Solve the Larger Storage Problem

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Page 34: Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many Analytics Techniques . . . Aloke Guha: Analytics Drives Big Data Drives Infrastructure,

34

Grateful Acknowledgements

• Kapil Tundwal

• Dr. Kirill Kireyev

• Dr. Andrew Lampert

• Venky Madireddy

• Dr. Shumin Wu

• Joan Wrabetz

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013