Upload
ofer-mendelevitch
View
516
Download
1
Embed Size (px)
DESCRIPTION
My talk from Hadoop Summit, June 2013.
Citation preview
© Hortonworks Inc. 2013
HortonworksData Science with Hadoop – A PrimerHadoop Summit, June 2013Ofer [email protected]@ofermend
© Hortonworks Inc. 2013 Page 2
Who am I?
currently <- c(role=“director of data sciences”, company=“Hortonworks”)
• Previously: Nor1, Yahoo!, Risk Insight, Quiver, etc…
• Blog: www.achessdad.com
© Hortonworks Inc. 2013 Page 3
What I will be talking about?
•What is Data Science?
•Hadoop and Data Science
•Use-cases: data science with Hadoop
•How to get started?
© Hortonworks Inc. 2013 Page 4
What is Data Science?
What is a data scientist?A person who does this
Data Product: software product whose core functionality relies on applying statistical (or machine learning) methods to data.
What is Data Science?The art of building data products
© Hortonworks Inc. 2013 Page 5
Data science & big data
© Hortonworks Inc. 2013 Page 6
With Hadoop…
Time and cost of building large scale data products is dramatically reduced
© Hortonworks Inc. 2013
ApplianceCloudOS / VM
An Apache Hadoop Platform
HORTONWORKS DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA, DR, Snapshots, Security, …
Distributed Storage & ProcessingHDFS
MAP REDUCE
DATASERVICES
Store, Process and Access Data
HCATALOG
HIVEPIGHBASE
SQOOP
FLUME
OPERATIONAL SERVICES
Manage & Operate at
ScaleOOZIE
AMBARI
© Hortonworks Inc. 2013
A typical Big Data Architecture
Page 8
APPL
ICAT
ION
SDA
TA S
YSTE
MS
TRADITIONAL REPOSRDBMS EDW MPP
DATA
SO
URC
ES
MOBILEDATA
OLTP, POS SYSTEMS
OPERATIONALTOOLS
MANAGE & MONITOR
Traditional Sources (RDBMS, OLTP, OLAP)
New Sources (web logs, email, sensor data, social media)
DEV & DATATOOLS
BUILD & TEST
Business Analytics
Custom Applications
Packaged Applications
HORTONWORKS DATA PLATFORM
© Hortonworks Inc. 2013 Page 9
Keys to Hadoop’s power
• Computation co-located with data– Data and computation system co-designed to work
together
• Affordable at scale– Use “commodity” hardware nodes– Self-healing; failure handled by software– Very good at batch processing of large datasets
© Hortonworks Inc. 2013 Page 10
Hadoop improves productivity of data scientists•All data in one place
–Ability to store all the data in raw format–Data silo convergence–Data scientists will find innovative uses of combined data
assets
•Data/compute capabilities available as shared asset–Data scientists can quickly prototype a new idea without an
up-front request for funding
© Hortonworks Inc. 2013 Page 11
Data-driven innovation is accelerated since Hadoop is “schema on read”
I need new data
Finally, we start
collecting
Let me see… is it
any good?
Start 6 months 9 months
“Schema change” project
Let’s just put it in a folder on
HDFS
Let me see… is it
any good?
3 months
My model is awesome!
© Hortonworks Inc. 2013 Page 12
Hadoop is ideal for pre-processing of large raw datasets
Strip away HTML/PDF/DOC/P
PT
Entity resolution
Document vector generation
Sampling, filtering
Joins
Raw Data Processed Data
Term normalization
© Hortonworks Inc. 2013 Page 13
In machine learning, very often:more data -> better outcomes
Banko & Brill, 2001
•More examples to learn from
•More possible feature types–We’re looking for the most useful
for our task
© Hortonworks Inc. 2013 Page 14
Use-cases
© Hortonworks Inc. 2013 Page 15
A (partial) map of data science “tasks”
Discovery
ClusteringDetect natural groupings
Outlier detectionDetect anomalies
Affinity AnalysisCo-occurrence patterns
Prediction
ClassificationPredict a category
RegressionPredict a value
RecommendationPredict a preference
Big Data Science: High energy physics, Genomics, etc
© Hortonworks Inc. 2013 Page 16
Use-case: product recommendation
• Inputs:–Explicit product ratings (when provided)–Implicit information: purchase transactions, page views,
comments
5 2 4 ? ?? ? 5 2 ?1 2 ? ? 3? 2 3 1 5
Epic
X-M
en
Hob
bit
Argo
Pira
tes
U101
U102
U103
U104
U105
…
Ratings
Page views
Forum Comments
© Hortonworks Inc. 2013 Page 17
Goal: predict a preference
5 2 4 ? ?? ? 5 2 ?1 2 ? ? 3? 2 3 1 5
Epic
X-M
en
Hob
bit
Argo
Pira
tes
5 2 4 1 34 1 5 2 31 2 4 1 33 2 3 1 5
U101
U102
U103
U104
U105
…
U101
U102
U103
U104
U105
…
Epic
X-M
en
Hob
bit
Argo
Pira
tes
© Hortonworks Inc. 2013 Page 18
Using Hadoop for recommendation
Pre-process
SQL
Online serving
HDFS
Map Reduce
Transactions
Page views
Content
Recommend
Data sources
CustomLogic
With Hadoop, we can process very large preference datasets
© Hortonworks Inc. 2013 Page 19
Use-case: failure prediction
• Inputs: –Equipment history: install date, model, past issues–Equipment sensor data–Product catalog: product families, expected lifetime
SKU Install date
Service Person ID
Zip code
Avg temp
TTF (days)
113454 5/1/2011 1345 94002 72 180998323 5/3/2009 3234 88321 68 450345375 8/2/2005 1112 53323 82 332… … … …
history
Sensor data
Product Catalog
© Hortonworks Inc. 2013 Page 20
Building a prediction model
SKU Install date
Service Person ID
Zip code
Avg temp
TTF (days)
113454 5/1/2011 1345 94002 72 180998323 5/3/2009 3234 88321 68 450345375 8/2/2005 1112 53323 82 332… … … …
Unseen data
Model
TTF
Labeled Data
SKU Install date
Service Person ID
Zip code
Avg temp
332456 3/3/2013 1345 94005 71
442343 6/6/2013 1112 77485 67
© Hortonworks Inc. 2013 Page 21
Using Hadoop for failure prediction
• HDFS: central repository for all data– Service records (word, pdf, etc)– Equipment purchase transaction data– Product catalog: SKUs, model numbers, etc
• Pre-process– Convert service records to item features: remove PDF
formatting, detect entities in records– Normalize data using service records, product catalog– Create feature matrix; ready for modeling algorithm
© Hortonworks Inc. 2013 Page 22
Use-case: SaaS application security
• Inputs:–Click-stream: user interaction with application
User ID User since
Logins/month
Avg DL KB/day
…
123456 1/3/2004 6 30998323 5/3/2009 1 5345375 8/2/2005 22 120… … … …
User data
Clicks
© Hortonworks Inc. 2013 Page 23
Detecting anomalous behavior records
• User access profile modeled as vector of features• Detect anomalies in application access patterns
– Rules based– Machine learning based (determine “outlier factor”: 0…1)
© Hortonworks Inc. 2013 Page 24
Using Hadoop for anomaly detection
• HDFS: central repository for all raw data– Raw user-access logs– User information (organization, demographics)
• Pre-process– Build access-profile (behavioral) for each user
• Detect anomalies– In Hadoop – Using existing tools: R, SAS, rules engine, etc
© Hortonworks Inc. 2013 Page 25
How do I get started?
© Hortonworks Inc. 2013 Page 26
1. Pick a good use-case that delivers immediate business value
2. Implement a proof-of-value (POV)
3. Build a team (hire/train)
Getting started with Data science on Hadoop
© Hortonworks Inc. 2013 Page 27
• Put together a Hadoop cluster • Define the POV business use-case• Pull raw data you need into the cluster• Build it• Show the business value of your data assets
Contact us. We can help!
Implement a proof-of-value
© Hortonworks Inc. 2013 Page 28
Build a team:The data scientist skillset continuum
Software engineer
ResearchScientist
DataEngineer
DataScientist
AppliedScientist
Role Data Engineer Applied ScientistFunction Builds production-grade data products Finds signal/meaning in the data
Applies statistical/ML models and tunes the algorithm
Good at…. Data and Systems architectureHadoop, PIG/HIVE, MapReduce, mahoutJava, Python, Perl, SQL, C++, etcNoSQL (Hbase, Cassandra, Mongo)
Statistics, Machine learningText processing, NLPR, Matlab, SAS, SQLSciptring, prototypingVisualization / telling the story
© Hortonworks Inc. 2013 Page 29
Thank you!
Any Questions?
Ofer MendelevitchDirector, Data Sciences @ [email protected]@ofermend
We’re hiring!
Data Science training: www.hortonworks.com/training