Data Science with Hadoop - A primer

© Hortonworks Inc. 2013

HortonworksData Science with Hadoop – A PrimerHadoop Summit, June 2013Ofer [email protected]@ofermend

© Hortonworks Inc. 2013 Page 2

Who am I?

currently <- c(role=“director of data sciences”, company=“Hortonworks”)

• Previously: Nor1, Yahoo!, Risk Insight, Quiver, etc…

• Blog: www.achessdad.com

http://www.achessdad.com/


What I will be talking about?

•What is Data Science?

•Hadoop and Data Science

•Use-cases: data science with Hadoop

•How to get started?


What is Data Science?

What is a data scientist?A person who does this

Data Product: software product whose core functionality relies on applying statistical (or machine learning) methods to data.

What is Data Science?The art of building data products


Data science & big data


With Hadoop…

Time and cost of building large scale data products is dramatically reduced


ApplianceCloudOS / VM

An Apache Hadoop Platform

HORTONWORKS DATA PLATFORM (HDP)

PLATFORM SERVICES

HADOOP CORE

Enterprise Readiness: HA, DR, Snapshots, Security, …

Distributed Storage & ProcessingHDFS

MAP REDUCE

DATASERVICES

Store, Process and Access Data

HCATALOG

HIVEPIGHBASE

SQOOP

FLUME

OPERATIONAL SERVICES

Manage & Operate at

ScaleOOZIE

AMBARI


A typical Big Data Architecture

Page 8

APPL

ICAT

ION

SDA

TA S

YSTE

MS

TRADITIONAL REPOSRDBMS EDW MPP

DATA

SO

URC

ES

MOBILEDATA

OLTP, POS SYSTEMS

OPERATIONALTOOLS

MANAGE & MONITOR

Traditional Sources (RDBMS, OLTP, OLAP)

New Sources (web logs, email, sensor data, social media)

DEV & DATATOOLS

BUILD & TEST

Business Analytics

Custom Applications

Packaged Applications

HORTONWORKS DATA PLATFORM


Keys to Hadoop’s power

• Computation co-located with data– Data and computation system co-designed to work

together

• Affordable at scale– Use “commodity” hardware nodes– Self-healing; failure handled by software– Very good at batch processing of large datasets


Hadoop improves productivity of data scientists•All data in one place

–Ability to store all the data in raw format–Data silo convergence–Data scientists will find innovative uses of combined data

assets

•Data/compute capabilities available as shared asset–Data scientists can quickly prototype a new idea without an

up-front request for funding


Data-driven innovation is accelerated since Hadoop is “schema on read”

I need new data

Finally, we start

collecting

Let me see… is it

any good?

Start 6 months 9 months

“Schema change” project

Let’s just put it in a folder on

HDFS

Let me see… is it

any good?

3 months

My model is awesome!


Hadoop is ideal for pre-processing of large raw datasets

Strip away HTML/PDF/DOC/P

PT

Entity resolution

Document vector generation

Sampling, filtering

Joins

Raw Data Processed Data

Term normalization


In machine learning, very often:more data -> better outcomes

Banko & Brill, 2001

•More examples to learn from

•More possible feature types–We’re looking for the most useful

for our task


Use-cases


A (partial) map of data science “tasks”

Discovery

ClusteringDetect natural groupings

Outlier detectionDetect anomalies

Affinity AnalysisCo-occurrence patterns

Prediction

ClassificationPredict a category

RegressionPredict a value

RecommendationPredict a preference

Big Data Science: High energy physics, Genomics, etc


Use-case: product recommendation

• Inputs:–Explicit product ratings (when provided)–Implicit information: purchase transactions, page views,

comments

5 2 4 ? ?? ? 5 2 ?1 2 ? ? 3? 2 3 1 5

Epic

X-M

en

Hob

bit

Argo

Pira

tes

U101

U102

U103

U104

U105

…

Ratings

Page views

Forum Comments


Goal: predict a preference

5 2 4 ? ?? ? 5 2 ?1 2 ? ? 3? 2 3 1 5

Epic

X-M

en

Hob

bit

Argo

Pira

tes

5 2 4 1 34 1 5 2 31 2 4 1 33 2 3 1 5

U101

U102

U103

U104

U105

…

U101

U102

U103

U104

U105

…

Epic

X-M

en

Hob

bit

Argo

Pira

tes


Using Hadoop for recommendation

Pre-process

SQL

Online serving

HDFS

Map Reduce

Transactions

Page views

Content

Recommend

Data sources

CustomLogic

With Hadoop, we can process very large preference datasets


Use-case: failure prediction

• Inputs: –Equipment history: install date, model, past issues–Equipment sensor data–Product catalog: product families, expected lifetime

SKU Install date

Service Person ID

Zip code

Avg temp

TTF (days)

113454 5/1/2011 1345 94002 72 180998323 5/3/2009 3234 88321 68 450345375 8/2/2005 1112 53323 82 332… … … …

history

Sensor data

Product Catalog


Building a prediction model

SKU Install date

Service Person ID

Zip code

Avg temp

TTF (days)

113454 5/1/2011 1345 94002 72 180998323 5/3/2009 3234 88321 68 450345375 8/2/2005 1112 53323 82 332… … … …

Unseen data

Model

TTF

Labeled Data

SKU Install date

Service Person ID

Zip code

Avg temp

332456 3/3/2013 1345 94005 71

442343 6/6/2013 1112 77485 67


Using Hadoop for failure prediction

• HDFS: central repository for all data– Service records (word, pdf, etc)– Equipment purchase transaction data– Product catalog: SKUs, model numbers, etc

• Pre-process– Convert service records to item features: remove PDF

formatting, detect entities in records– Normalize data using service records, product catalog– Create feature matrix; ready for modeling algorithm


Use-case: SaaS application security

• Inputs:–Click-stream: user interaction with application

User ID User since

Logins/month

Avg DL KB/day

…

123456 1/3/2004 6 30998323 5/3/2009 1 5345375 8/2/2005 22 120… … … …

User data

Clicks


Detecting anomalous behavior records

• User access profile modeled as vector of features• Detect anomalies in application access patterns

– Rules based– Machine learning based (determine “outlier factor”: 0…1)


Using Hadoop for anomaly detection

• HDFS: central repository for all raw data– Raw user-access logs– User information (organization, demographics)

• Pre-process– Build access-profile (behavioral) for each user

• Detect anomalies– In Hadoop – Using existing tools: R, SAS, rules engine, etc


How do I get started?


1. Pick a good use-case that delivers immediate business value

2. Implement a proof-of-value (POV)

3. Build a team (hire/train)

Getting started with Data science on Hadoop


• Put together a Hadoop cluster • Define the POV business use-case• Pull raw data you need into the cluster• Build it• Show the business value of your data assets

Contact us. We can help!

Implement a proof-of-value


Build a team:The data scientist skillset continuum

Software engineer

ResearchScientist

DataEngineer

DataScientist

AppliedScientist

Role Data Engineer Applied ScientistFunction Builds production-grade data products Finds signal/meaning in the data

Applies statistical/ML models and tunes the algorithm

Good at…. Data and Systems architectureHadoop, PIG/HIVE, MapReduce, mahoutJava, Python, Perl, SQL, C++, etcNoSQL (Hbase, Cassandra, Mongo)

Statistics, Machine learningText processing, NLPR, Matlab, SAS, SQLSciptring, prototypingVisualization / telling the story


Thank you!

Any Questions?

Ofer MendelevitchDirector, Data Sciences @ [email protected]@ofermend

We’re hiring!

Data Science training: www.hortonworks.com/training

mailto:[email protected]

Technology

Data Science with Hadoop - A primer