Analytics Drives Big Data Drives Infrastructure Confessions of … · 2013. 5. 9. · 6 Many...

Preview:

Citation preview

Analytics Drives Big Data Drives Infrastructure Confessions of Storage turned Analytics Geeks

Dr. Aloke Guha

29th IEEE Conference on Massive Data Storage

May 8th, 2013

aloke@cruxly.com

2

What’s Common Between a Sensor that could Distinguish a fine Cognac, and Predicting Movies You’d Like on Netflix?

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

The Sommelier “Robot”

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 3

Predicting What Movies You’d Watch

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 4

5

(Analytics, BigData, DataStore)+

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

6

Many Analytics Techniques . . .

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Statistics

Regression Linear

Time-Series

Decision Trees

R

AI (McCarthy) 1956

Expert Systems

Machine Learning

Neural Networks

SVM LDA

Naïve Bayes K-nearest

neighbor Random Forests

. . . Genetic

Algorithms

Random Forests

SNARC (Minsky) 1951

Dendral (Feigenbaum) 1965

Fraser and Burnell (1970)

. . . Vapnik (1992)

Ihaka and Gentleman (1993)

7

Common Analytics Processing pre-2000

• Sources: Local

• Data: Numeric, Homogeneous

• Processing: Local

• Consumer: Local

• Analytics: Linear/Non-Linear Regression, Neural Networks, SVM, LDA, LSA, Decision Trees, Monte Carlo, Lin-Ops, Expert Systems . . .

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Flavor Predictor – Neural Networks

USPTO #5,373,452 (1994) 1988

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 8

Pattern Recognition – Genetic Algorithms

US PTO #5,140,530, 1992

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 9

10

Small to Big

http://article.wn.com/view/2013/04/04/Big_data_forefather_Michael_Stonebraker_shows_no_signs_of_sl/#/related_news

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

11

Typical Analytics: 2000-2006

• Sources: Global , Social Networks

• Data: Heterogeneous, Numeric, Text

• Processing: Hosted/Scale

• Consumer: Global

• Analytics: Batch Mode, Social Media Marketing, Churn Detection, Sentiment Analysis, etc.

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

2007- : Internet Data Analytics

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 12

Financial Risk Scoring: Detect

Risk Scoring: detect incremental change in # occurrences where corporate officers

mention “risk” (or equivalent terms) during earnings call

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 13

Financial Risk Scoring: Listen

*Risk Scoring: detect incremental change in occurrences where corporate officers

mention “risk” (or semantically equivalent terms) during the corporate earnings call

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 14

Banking: Credit Worthiness – remember 2008?

Analyze bank reports to assess loans, payments, recoveries, etc. for key bank

indexes, groups of banks, or individual banks

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 15

Share of Voice: Online Buzz

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 16

Sentiment Analysis

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 17

18

Analytics Processing: 2007-

• Sources: Global, Mobile, New Social (Instagram, . . )

• Data: Multi-Dimensional, Heterogeneous, Audio/Video

• Processing: Hosted/Scale

• Consumer: Global

• Analytics: Batch, Streaming, . . .

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

2008 - : Real-Time/Streaming Analytics

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 19

Brand Marketing

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 20

Brand Management

21

Customer Support

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 22

Customer Support

23

24

Lead Generation

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

. . . More Data, Faster

http://www.cioinsight.com/it-strategy/big-data/data-analytics-allows-pg-to-turn-on-a-dime/?kc=CIOMINUTE05062013CIOA

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 25

“Internet of Things”

http://www.news-sap.com/survey-by-sap-and-harris-interactive-finds-brazil-china-germany-and-india-most-ready-for-

m2m-technology-to-drive-connected-smarter-cities/

Message Queuing Telemetry Transport

Machine-to-Machine

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 26

27

AumniData: Batch Processing

Data Collector (Batch Scheduled)

Twitter Blog/Web Site

Data Collector (Batch Scheduled)

RSS/ATOM

Feed Requestor/

URL Scanner

NLP+ Cruxly Intent

Detection (AWS)

NLP+ Cruxly Intent

Detection (AWS)

NLP+ Cruxly Intent

Detection (AWS)

NLP+ Cruxly Intent

Detection (AWS)

NLP+ Cruxly Intent

Detection (AWS)

NLP Stack+ AumniData

Classifier + Analytics*

(RackSpace VM)

Dashboard

Application (.3rd party App)

Blog/Web Site

Blog/Web Site YouTube

Dashboard

Configuration (TomCat)

Custom Analytics

Display Ad-Hoc Query

Summary

Data Collector (Batch Scheduled)

Content

Store

Content /

Metadata

Index

(MySQL)

Dashboard

Store

(SQL Server)

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

28

Cruxly: Stream Processing

Streaming API Client (Heroku Worker)

(24x7)

Streaming API Client (Heroku Worker)

(24x7)

NLP+ Cruxly Intent

Detection (AWS)

Streaming API Client (Heroku Worker)

(24x7)

Tweets

(Keywords)

Request

(Keywords)

Tweets

(Keywords) Tweet ID + Intent

Signal

(Heroku

PostgresSQL)

Tweets

Content Store

(DynamoDB)

NLP+ Cruxly Intent

Detection (AWS)

NLP+ Cruxly Intent

Detection (AWS)

NLP+ Cruxly Intent

Detection (AWS)

NLP+ Cruxly Intent

Detection (AWS)

NLP (NER, etc + Cruxly

Intent Detection (AWS)

Reports / Dashboard

Tracker Editor (web app - Heroku)

Twitter

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

29

Data Analytics Demands . . .

Store

Process

Analyze

View

Store

Process

Analyze

View

Storm

Data Collector Text / Sensor Data/ Stream . . .

NLP Classify

Index

Query/ RT Query Ad Hoc/ Search/ SQL

Custom Analytics

Dashboards Chart

Report

Machine

Learning

Library

Stats

Library

R

Yarn

Storage Implications: Back to the Future

MB/s – Batch

IOPs – Stream

Both?

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013 30

Storage Implications: Back to the Future II, III

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Task tracker

Task tracker

Task tracker

Job Tracker

Zookeeper

Hive

Pig

Oozie

HUE

HDFS client Data Node Data Node Data Node

Name Node

Ma

pR

ed

uce

H

DF

S

Master Slave #1 Slave #N Mgmt Node

Storage Capacity Scaling?

31

Storage Tiering?

Import/Export Data?

A More General Data Analytics Framework?

Data Ingesters (Basic)

Data Ingesters (Smart)

Content Store Metadata / In-Mem Store

Processing Stream and Batch

Data Ingesters

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

An

alyt

ics

Pro

cess

ing

Sen

sor

Pro

cess

ing:

Dat

a In

tegr

atio

n

Vis

ual

izat

ion

Lib

rary

/ In

tera

ctiv

e Q

ue

ry

Loca

l Sto

rage

/ Fl

ash

/ D

AS

SA

N

Map

Re

du

ce /

Dis

trib

ute

d D

ata

Sto

re

32

33

Conclusion

• Data Analytics Big Data Scale-Out

• Variety Infrastructure

• Volume Bandwidth Support

• Velocity Streaming Support

• We Solved the Processing Problem

• We Need to Solve the Larger Storage Problem

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

34

Grateful Acknowledgements

• Kapil Tundwal

• Dr. Kirill Kireyev

• Dr. Andrew Lampert

• Venky Madireddy

• Dr. Shumin Wu

• Joan Wrabetz

Aloke Guha: Analytics Drives Big Data Drives Infrastructure, 29th IEEE MSST 2013

Recommended