29
© Hortonworks Inc. 2013 Hortonworks Data Science with Hadoop – A Primer Hadoop Summit, June 2013 Ofer Mendelevitch [email protected] @ofermend

Data Science with Hadoop - A primer

Embed Size (px)

DESCRIPTION

My talk from Hadoop Summit, June 2013.

Citation preview

Page 1: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013

HortonworksData Science with Hadoop – A PrimerHadoop Summit, June 2013Ofer [email protected]@ofermend

Page 2: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 2

Who am I?

currently <- c(role=“director of data sciences”, company=“Hortonworks”)

• Previously: Nor1, Yahoo!, Risk Insight, Quiver, etc…

• Blog: www.achessdad.com

Page 3: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 3

What I will be talking about?

•What is Data Science?

•Hadoop and Data Science

•Use-cases: data science with Hadoop

•How to get started?

Page 4: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 4

What is Data Science?

What is a data scientist?A person who does this

Data Product: software product whose core functionality relies on applying statistical (or machine learning) methods to data.

What is Data Science?The art of building data products

Page 5: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 5

Data science & big data

Page 6: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 6

With Hadoop…

Time and cost of building large scale data products is dramatically reduced

Page 7: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013

ApplianceCloudOS / VM

An Apache Hadoop Platform

HORTONWORKS DATA PLATFORM (HDP)

PLATFORM SERVICES

HADOOP CORE

Enterprise Readiness: HA, DR, Snapshots, Security, …

Distributed Storage & ProcessingHDFS

MAP REDUCE

DATASERVICES

Store, Process and Access Data

HCATALOG

HIVEPIGHBASE

SQOOP

FLUME

OPERATIONAL SERVICES

Manage & Operate at

ScaleOOZIE

AMBARI

Page 8: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013

A typical Big Data Architecture

Page 8

APPL

ICAT

ION

SDA

TA S

YSTE

MS

TRADITIONAL REPOSRDBMS EDW MPP

DATA

SO

URC

ES

MOBILEDATA

OLTP, POS SYSTEMS

OPERATIONALTOOLS

MANAGE & MONITOR

Traditional Sources (RDBMS, OLTP, OLAP)

New Sources (web logs, email, sensor data, social media)

DEV & DATATOOLS

BUILD & TEST

Business Analytics

Custom Applications

Packaged Applications

HORTONWORKS DATA PLATFORM

Page 9: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 9

Keys to Hadoop’s power

• Computation co-located with data– Data and computation system co-designed to work

together

• Affordable at scale– Use “commodity” hardware nodes– Self-healing; failure handled by software– Very good at batch processing of large datasets

Page 10: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 10

Hadoop improves productivity of data scientists•All data in one place

–Ability to store all the data in raw format–Data silo convergence–Data scientists will find innovative uses of combined data

assets

•Data/compute capabilities available as shared asset–Data scientists can quickly prototype a new idea without an

up-front request for funding

Page 11: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 11

Data-driven innovation is accelerated since Hadoop is “schema on read”

I need new data

Finally, we start

collecting

Let me see… is it

any good?

Start 6 months 9 months

“Schema change” project

Let’s just put it in a folder on

HDFS

Let me see… is it

any good?

3 months

My model is awesome!

Page 12: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 12

Hadoop is ideal for pre-processing of large raw datasets

Strip away HTML/PDF/DOC/P

PT

Entity resolution

Document vector generation

Sampling, filtering

Joins

Raw Data Processed Data

Term normalization

Page 13: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 13

In machine learning, very often:more data -> better outcomes

Banko & Brill, 2001

•More examples to learn from

•More possible feature types–We’re looking for the most useful

for our task

Page 14: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 14

Use-cases

Page 15: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 15

A (partial) map of data science “tasks”

Discovery

ClusteringDetect natural groupings

Outlier detectionDetect anomalies

Affinity AnalysisCo-occurrence patterns

Prediction

ClassificationPredict a category

RegressionPredict a value

RecommendationPredict a preference

Big Data Science: High energy physics, Genomics, etc

Page 16: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 16

Use-case: product recommendation

• Inputs:–Explicit product ratings (when provided)–Implicit information: purchase transactions, page views,

comments

5 2 4 ? ?? ? 5 2 ?1 2 ? ? 3? 2 3 1 5

Epic

X-M

en

Hob

bit

Argo

Pira

tes

U101

U102

U103

U104

U105

Ratings

Page views

Forum Comments

Page 17: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 17

Goal: predict a preference

5 2 4 ? ?? ? 5 2 ?1 2 ? ? 3? 2 3 1 5

Epic

X-M

en

Hob

bit

Argo

Pira

tes

5 2 4 1 34 1 5 2 31 2 4 1 33 2 3 1 5

U101

U102

U103

U104

U105

U101

U102

U103

U104

U105

Epic

X-M

en

Hob

bit

Argo

Pira

tes

Page 18: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 18

Using Hadoop for recommendation

Pre-process

SQL

Online serving

HDFS

Map Reduce

Transactions

Page views

Content

Recommend

Data sources

CustomLogic

With Hadoop, we can process very large preference datasets

Page 19: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 19

Use-case: failure prediction

• Inputs: –Equipment history: install date, model, past issues–Equipment sensor data–Product catalog: product families, expected lifetime

SKU Install date

Service Person ID

Zip code

Avg temp

TTF (days)

113454 5/1/2011 1345 94002 72 180998323 5/3/2009 3234 88321 68 450345375 8/2/2005 1112 53323 82 332… … … …

history

Sensor data

Product Catalog

Page 20: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 20

Building a prediction model

SKU Install date

Service Person ID

Zip code

Avg temp

TTF (days)

113454 5/1/2011 1345 94002 72 180998323 5/3/2009 3234 88321 68 450345375 8/2/2005 1112 53323 82 332… … … …

Unseen data

Model

TTF

Labeled Data

SKU Install date

Service Person ID

Zip code

Avg temp

332456 3/3/2013 1345 94005 71

442343 6/6/2013 1112 77485 67

Page 21: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 21

Using Hadoop for failure prediction

• HDFS: central repository for all data– Service records (word, pdf, etc)– Equipment purchase transaction data– Product catalog: SKUs, model numbers, etc

• Pre-process– Convert service records to item features: remove PDF

formatting, detect entities in records– Normalize data using service records, product catalog– Create feature matrix; ready for modeling algorithm

Page 22: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 22

Use-case: SaaS application security

• Inputs:–Click-stream: user interaction with application

User ID User since

Logins/month

Avg DL KB/day

123456 1/3/2004 6 30998323 5/3/2009 1 5345375 8/2/2005 22 120… … … …

User data

Clicks

Page 23: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 23

Detecting anomalous behavior records

• User access profile modeled as vector of features• Detect anomalies in application access patterns

– Rules based– Machine learning based (determine “outlier factor”: 0…1)

Page 24: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 24

Using Hadoop for anomaly detection

• HDFS: central repository for all raw data– Raw user-access logs– User information (organization, demographics)

• Pre-process– Build access-profile (behavioral) for each user

• Detect anomalies– In Hadoop – Using existing tools: R, SAS, rules engine, etc

Page 25: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 25

How do I get started?

Page 26: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 26

1. Pick a good use-case that delivers immediate business value

2. Implement a proof-of-value (POV)

3. Build a team (hire/train)

Getting started with Data science on Hadoop

Page 27: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 27

• Put together a Hadoop cluster • Define the POV business use-case• Pull raw data you need into the cluster• Build it• Show the business value of your data assets

Contact us. We can help!

Implement a proof-of-value

Page 28: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 28

Build a team:The data scientist skillset continuum

Software engineer

ResearchScientist

DataEngineer

DataScientist

AppliedScientist

Role Data Engineer Applied ScientistFunction Builds production-grade data products Finds signal/meaning in the data

Applies statistical/ML models and tunes the algorithm

Good at…. Data and Systems architectureHadoop, PIG/HIVE, MapReduce, mahoutJava, Python, Perl, SQL, C++, etcNoSQL (Hbase, Cassandra, Mongo)

Statistics, Machine learningText processing, NLPR, Matlab, SAS, SQLSciptring, prototypingVisualization / telling the story

Page 29: Data Science with Hadoop - A primer

© Hortonworks Inc. 2013 Page 29

Thank you!

Any Questions?

Ofer MendelevitchDirector, Data Sciences @ [email protected]@ofermend

We’re hiring!

Data Science training: www.hortonworks.com/training