Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process

1© Cloudera, Inc. All rights reserved.

Transforming Analytics with Cloudera Data Science WorkBench

Process data, develop and serve predictive models.


Age of Machine Learning

2

Cost of compute

Data volume

Time

MachineLearning

NOMachineLearning

1950s 1960s 1970s 1980s 1990s 2000s 2010s


Our current platform

OPERATIONSCloudera Manager

Cloudera Director

DATA MANAGEMENT

Cloudera Navigator

Encrypt and KeyTrustee

Optimizer

STRUCTUREDSqoop

UNSTRUCTUREDKafka, Flume

PROCESS, ANALYZE, SERVE

UNIFIED SERVICES

RESOURCE MANAGEMENTYARN

SECURITYSentry, RecordService

STORE

INTEGRATE

BATCHSpark, Hive, Pig

MapReduce

STREAMSpark

SQLImpala

SEARCHSolr

OTHERKite

NoSQLHBase

OTHERObject Store

FILESYSTEMHDFS

RELATIONALKudu


Apache SparkDe facto Data Processing and Modern Analytic Engine


Apache SparkFast and flexible general purpose data processing for Hadoop

Data Engineering

Stream Processing

Data Science & Machine Learning

Unified API and processing Engine for large scale data


Spark Addresses Common Limitations

Access and UsabilityOne of the key advantages of Apache Spark is the intuitive and flexible API for big-data processing, available in popular programming languages. Prior to Apache Spark, users had access to very limited inflexible abstractions for processing large distributed data, with poor support outside Java.

Data Processing PerformanceMapReduce made big strides in enabling cost effective batch processing of large volumes of data. However businesses continue to see a need to shorten data processing windows and consume data faster, requiring a new framework with significantly better performance.

Machine Learning at ScaleData Science and Machine Learning on big-data are exciting areas of focus. However that requires libraries and that enable building models on large distributed data and APIs that allow flexible exploration of data.


Apache Spark

Apache Spark is at the core of our data science

experience

• Libraries for common machine learning

• Trusted in production by our customers

• Delivered with expert support and training

• A requirement for our Data Science Workbench

Apache Spark is a huge driver for machine

learning

• Native language development tools

• Reliable operation at big data scale

• Native access to Hadoop data for testing and training

Spark 2.1 is here

• Separate parcel for easy implementation for multiple Spark instances

• Better Streaming Performance

• Machine Learning Persistence


Machine Learning


Machine Learning on Hadoop

Raw Data- many

sources- many

formats- varying

validity

Validated ML Models

End User

Data Engineering

Data Science

Well-formated data

Training, validation, and test data

cleaning

merging

filtering

model building

model training

hyper-paramtuning

pipeline execution

production operation

Data Engineering

Consump-tion for analysis

ongoing data ingestion


Machine Learning Deployment Patterns

• Build in Notebooks

• Train on CDH (Spark ML)

• Deliver on transactional systems or run batch

• Build on CDH (Workbench)

• Train of CDH (Spark ML)

• Deliver on transactional systems

• Build on CDH (Workbench)

• Train on CDH (Spark ML)

• Deliver on CDH (Hbase/Kudu/Spark Streaming)

Train Build and Train Build, Train, and Serve


Apache Spark MLlibCollection of mainstream machine learning algorithms built on Spark

Including:

• Classifiers: logistic regression, boosted trees, random forests, etc

• Clustering: k-means, Latent Dirichlet Allocation (LDA)

• Recommender Systems: Alternating Least Squares

• Dimensionality Reduction: Principal Component Analysis (PCA) and Singular Value Decomposition (SVD)

• Feature Engineering & Selection: TF-IDF, Word2Vec, Normalizer, etc

• Statistical Functions: Chi-Squared Test, Pearson Correlation, etc


Cloudera Data ScienceSelf-Service Data Science for the Enterprise


• Team: Data scientists and analysts• Goal: Understand data, develop and improve models,

share insights

• Data: New and changing; often sampled• Environment: Local machine, sandbox cluster• Tools: R, Python, SAS/SPSS, SQL; notebooks; data

wrangling/discovery tools, …• End State: Reports, dashboards, PDF, MS Office

• Team: Data engineers, developers, SREs• Goal: Build and maintain applications, improve

model performance, manage models in production

• Data: Known data; full scale• Environment: Production clusters• Tools: Java/Scala, C++; IDEs; continuous

integration, source control, …• End State: Online/production applications

Types of data science

Exploratory(discover and quantify opportunities)

Operational(deploy production systems)


https://medium.com/@KevinSchmidtBiz/data-engineer-vs-data-scientist-vs-business-analyst-b68d201364bc


Common Limitations

AccessMany times secured clusters are hard for data science professionals to connect either because they don’t have the right permissions or resources are to scarce to afford them access. In addition popular frameworks and libraries don’t read Hadoop data formats out-of-the-box.

ScaleNotebook environments seldom have large enough data storage for medium, let alone big data. Data scientists are often relegated to sample data and constrained when working on distributed systems. Popular frameworks and libraries don’t easily parallelize across the cluster.

Developer ExperiencePopular notebooks don’t work well with access engines like Spark and package deployment and dependency management across multiple software versions is often hard to manage. Then once a model is built there is no easy path from model development to production


Open data science in the enterprise

ITdrive adoption while maintaining compliance

Data Scientistexplore, experiment, iterate



Solving Data Science is a Full-Stack Problem

• Leverage Big Data

• Enable real-time use cases

• Provide sufficient toolset for the Data Analysts

• Provide sufficient toolset for the Data Scientists + Data Engineers

• Provide standard data governance capabilities

• Provide standard security across the stack

• Provide flexible deployment options

• Integrate with partner tools

• Provide management tools that make it easy for IT to deploy/maintain

✓Hadoop

✓Kafka, Spark Streaming

✓Spark, Hive, Hue

✓Data Science Workbench (beta)

✓Navigator + Partners

✓Kerberos, Sentry, Record Service, KMS/KTS

✓Cloudera Director

✓Rich Ecosystem

✓Cloudera Manager/Director


Data Science WorkbenchSelf-service data science for the enterprise


Introducing Cloudera Data Science WorkbenchSelf-service data science for the enterprise

Accelerates data science from development to production with:

• Secure self-service environments for data scientists to work against Cloudera clusters

• Support for Python, R, and Scala, plus project dependency isolation for multiple library versions

• Workflow automation, version control, collaboration and sharing


Key BenefitsHow is Cloudera Data Science different?

Works with fully secured clusters

One tool for multiple languages (Python, R, Scala)

Multi-tenant Architecture

Common Platform


Security, Lineage and Governance

Ingestion

Flume/Sqoop/

Kafka

Analytics

Hive/Impala/S

park/Search

ML

spark.mllib

Deep

Learning

Frameworks

HDFS

Session A

Session B

Session N

Cloudera Manager


How does CDSW help!

Visu

alize results

Ch

ange an

d C

om

pile So

urce

cod

e

Retrain

and

rede

plo

y

Extensib

le Engin

es

Co

nfigu

rable Se

ssion

s

Trivial to tw

eak param

eters

Mu

ltiple U

sers

Roles/Governance

CDH


Thank You

Documents

Transforming Analytics with Cloudera Data Science … 2_speaker 4_Clo… · © Cloudera, Inc. All rights reserved. 1 Transforming Analytics with Cloudera Data Science WorkBench Process