Upload
vanhanh
View
214
Download
0
Embed Size (px)
Citation preview
1© Cloudera, Inc. All rights reserved.
Transforming Analytics with Cloudera Data Science WorkBench
Process data, develop and serve predictive models.
2© Cloudera, Inc. All rights reserved.
Age of Machine Learning
2
Cost of compute
Data volume
Time
MachineLearning
NOMachineLearning
1950s 1960s 1970s 1980s 1990s 2000s 2010s
3© Cloudera, Inc. All rights reserved.
Our current platform
OPERATIONSCloudera Manager
Cloudera Director
DATA MANAGEMENT
Cloudera Navigator
Encrypt and KeyTrustee
Optimizer
STRUCTUREDSqoop
UNSTRUCTUREDKafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENTYARN
SECURITYSentry, RecordService
STORE
INTEGRATE
BATCHSpark, Hive, Pig
MapReduce
STREAMSpark
SQLImpala
SEARCHSolr
OTHERKite
NoSQLHBase
OTHERObject Store
FILESYSTEMHDFS
RELATIONALKudu
4© Cloudera, Inc. All rights reserved.
Apache SparkDe facto Data Processing and Modern Analytic Engine
5© Cloudera, Inc. All rights reserved.
Apache SparkFast and flexible general purpose data processing for Hadoop
Data Engineering
Stream Processing
Data Science & Machine Learning
Unified API and processing Engine for large scale data
6© Cloudera, Inc. All rights reserved.
Spark Addresses Common Limitations
Access and UsabilityOne of the key advantages of Apache Spark is the intuitive and flexible API for big-data processing, available in popular programming languages. Prior to Apache Spark, users had access to very limited inflexible abstractions for processing large distributed data, with poor support outside Java.
Data Processing PerformanceMapReduce made big strides in enabling cost effective batch processing of large volumes of data. However businesses continue to see a need to shorten data processing windows and consume data faster, requiring a new framework with significantly better performance.
Machine Learning at ScaleData Science and Machine Learning on big-data are exciting areas of focus. However that requires libraries and that enable building models on large distributed data and APIs that allow flexible exploration of data.
7© Cloudera, Inc. All rights reserved.
Apache Spark
Apache Spark is at the core of our data science
experience
• Libraries for common machine learning
• Trusted in production by our customers
• Delivered with expert support and training
• A requirement for our Data Science Workbench
Apache Spark is a huge driver for machine
learning
• Native language development tools
• Reliable operation at big data scale
• Native access to Hadoop data for testing and training
Spark 2.1 is here
• Separate parcel for easy implementation for multiple Spark instances
• Better Streaming Performance
• Machine Learning Persistence
8© Cloudera, Inc. All rights reserved.
Machine Learning
9© Cloudera, Inc. All rights reserved.
Machine Learning on Hadoop
Raw Data- many
sources- many
formats- varying
validity
Validated ML Models
End User
Data Engineering
Data Science
Well-formated data
Training, validation, and test data
cleaning
merging
filtering
model building
model training
hyper-paramtuning
pipeline execution
production operation
Data Engineering
Consump-tion for analysis
ongoing data ingestion
10© Cloudera, Inc. All rights reserved.
Machine Learning Deployment Patterns
• Build in Notebooks
• Train on CDH (Spark ML)
• Deliver on transactional systems or run batch
• Build on CDH (Workbench)
• Train of CDH (Spark ML)
• Deliver on transactional systems
• Build on CDH (Workbench)
• Train on CDH (Spark ML)
• Deliver on CDH (Hbase/Kudu/Spark Streaming)
Train Build and Train Build, Train, and Serve
11© Cloudera, Inc. All rights reserved.
Apache Spark MLlibCollection of mainstream machine learning algorithms built on Spark
Including:
• Classifiers: logistic regression, boosted trees, random forests, etc
• Clustering: k-means, Latent Dirichlet Allocation (LDA)
• Recommender Systems: Alternating Least Squares
• Dimensionality Reduction: Principal Component Analysis (PCA) and Singular Value Decomposition (SVD)
• Feature Engineering & Selection: TF-IDF, Word2Vec, Normalizer, etc
• Statistical Functions: Chi-Squared Test, Pearson Correlation, etc
12© Cloudera, Inc. All rights reserved.
Cloudera Data ScienceSelf-Service Data Science for the Enterprise
13© Cloudera, Inc. All rights reserved.
• Team: Data scientists and analysts• Goal: Understand data, develop and improve models,
share insights
• Data: New and changing; often sampled• Environment: Local machine, sandbox cluster• Tools: R, Python, SAS/SPSS, SQL; notebooks; data
wrangling/discovery tools, …• End State: Reports, dashboards, PDF, MS Office
• Team: Data engineers, developers, SREs• Goal: Build and maintain applications, improve
model performance, manage models in production
• Data: Known data; full scale• Environment: Production clusters• Tools: Java/Scala, C++; IDEs; continuous
integration, source control, …• End State: Online/production applications
Types of data science
Exploratory(discover and quantify opportunities)
Operational(deploy production systems)
14© Cloudera, Inc. All rights reserved.
https://medium.com/@KevinSchmidtBiz/data-engineer-vs-data-scientist-vs-business-analyst-b68d201364bc
15© Cloudera, Inc. All rights reserved.
Common Limitations
AccessMany times secured clusters are hard for data science professionals to connect either because they don’t have the right permissions or resources are to scarce to afford them access. In addition popular frameworks and libraries don’t read Hadoop data formats out-of-the-box.
ScaleNotebook environments seldom have large enough data storage for medium, let alone big data. Data scientists are often relegated to sample data and constrained when working on distributed systems. Popular frameworks and libraries don’t easily parallelize across the cluster.
Developer ExperiencePopular notebooks don’t work well with access engines like Spark and package deployment and dependency management across multiple software versions is often hard to manage. Then once a model is built there is no easy path from model development to production
16© Cloudera, Inc. All rights reserved.
Open data science in the enterprise
ITdrive adoption while maintaining compliance
Data Scientistexplore, experiment, iterate
17© Cloudera, Inc. All rights reserved.
18© Cloudera, Inc. All rights reserved.
Solving Data Science is a Full-Stack Problem
• Leverage Big Data
• Enable real-time use cases
• Provide sufficient toolset for the Data Analysts
• Provide sufficient toolset for the Data Scientists + Data Engineers
• Provide standard data governance capabilities
• Provide standard security across the stack
• Provide flexible deployment options
• Integrate with partner tools
• Provide management tools that make it easy for IT to deploy/maintain
✓Hadoop
✓Kafka, Spark Streaming
✓Spark, Hive, Hue
✓Data Science Workbench (beta)
✓Navigator + Partners
✓Kerberos, Sentry, Record Service, KMS/KTS
✓Cloudera Director
✓Rich Ecosystem
✓Cloudera Manager/Director
19© Cloudera, Inc. All rights reserved.
Data Science WorkbenchSelf-service data science for the enterprise
20© Cloudera, Inc. All rights reserved.
Introducing Cloudera Data Science WorkbenchSelf-service data science for the enterprise
Accelerates data science from development to production with:
• Secure self-service environments for data scientists to work against Cloudera clusters
• Support for Python, R, and Scala, plus project dependency isolation for multiple library versions
• Workflow automation, version control, collaboration and sharing
21© Cloudera, Inc. All rights reserved.
Key BenefitsHow is Cloudera Data Science different?
Works with fully secured clusters
One tool for multiple languages (Python, R, Scala)
Multi-tenant Architecture
Common Platform
22© Cloudera, Inc. All rights reserved.
Security, Lineage and Governance
Ingestion
Flume/Sqoop/
Kafka
Analytics
Hive/Impala/S
park/Search
ML
spark.mllib
Deep
Learning
Frameworks
HDFS
Session A
Session B
Session N
Cloudera Manager
23© Cloudera, Inc. All rights reserved.
How does CDSW help!
Visu
alize results
Ch
ange an
d C
om
pile So
urce
cod
e
Retrain
and
rede
plo
y
Extensib
le Engin
es
Co
nfigu
rable Se
ssion
s
Trivial to tw
eak param
eters
Mu
ltiple U
sers
Roles/Governance
CDH
24© Cloudera, Inc. All rights reserved.
Thank You