Upload
continuum-analytics
View
190
Download
2
Embed Size (px)
Citation preview
SCALABLE & DEPLOYABLE DATA SCIENCE WITH THE ANACONDA PLATFORM
Kristopher OverholtProduct Manager
Continuum Analytics
#OpenDataScienceMeans #AnacondaCON
OVERVIEW• Collaborative Data Science Workflows
• Scaling Out with Anaconda• Spectrum of parallelization• Spark, Hadoop, Dask, and other parallel frameworks• Example distributed/parallel use cases
• Productionizing Data Science Projects
• Enterprise deployment considerations
• Deploying Data Science Projects• Notebooks, dashboards, interactive applications, and models with APIs
#OpenDataScienceMeans #AnacondaCON
COLLABORATIVE DATA SCIENCE WORKFLOWS
#OpenDataScienceMeans #AnacondaCON
COLLABORATIVE DATA SCIENCE WORKFLOWSData science teams often use intermediate deployments and modular, layered development approaches for data ingest, data cleaning, computation, machine learning, visualization, etc.
#OpenDataScienceMeans #AnacondaCON
ANACONDA - SCALED OUT OPEN DATA SCIENCE
Application and Visualization Jupyter Notebook, Matplotlib, seaborn, Bokeh, etc.
Analytics pandas, NumPy, SciPy, Numba, scikit-learn, NLTK, scikit-image, PIL, and more
Computation PySpark, SparkR Dask, Distributed
Data and Resource Management HDFS, NFS, YARN, SGE, SLURM
Servers Bare-metal or Cloud-based Cluster Clus
ter
Ana
cond
a
#OpenDataScienceMeans #AnacondaCON
SPECTRUM OF PARALLELIZATION
ThreadsProcesses
MPIZeroMQ
Explicit control: Fast but low-level Implicit control: Restrictive but easy
Dask HadoopSpark
SQL:HivePig
Impala
#OpenDataScienceMeans #AnacondaCON
SCALING OUT WITH ANACONDA AND SPARKUsing Anaconda with Spark is:
• Extensible: Use libraries from Anaconda with PySpark and SparkR jobs
• Integrated: Use interactive notebooks with data in HDFS and on YARN clusters
• Secure: Works with Kerberized Hadoop clusters
• Scalable: Map pandas, NumPy, SciPy jobs on large clusters and data sets
• Seamless: Works with Cloudera CDH, Hortonworks HDP, and other enterprise Hadoop distributions
Anaconda dramatically simplifies the installation and management of popular Python and R packages and their dependencies.
#OpenDataScienceMeans #AnacondaCON
SCALING OUT WITH ANACONDA AND DASKDask is a Python parallel computing library that is:
• Familiar: Implements parallel NumPy and Pandas objects
• Fast: Optimized for demanding for numerical applications
• Flexible: for sophisticated and messy algorithms
• Scales up: Runs resiliently on clusters of 100s of machines
• Scales down: Pragmatic in a single process on a laptop
• Interactive: Responsive and fast for interactive data science
#OpenDataScienceMeans #AnacondaCON
OTHER WAYS TO SCALE OUT WITH ANACONDAAnaconda integrates with:
• Spark (PySpark, SparkR) and other
Hadoop components, including YARN,
HDFS, Hive, Impala, and more
• Dask, Distributed, knit, dask-ec2, hdfs3,
fastparquet
• CSV, SQL, JSON, HDF5, Parquet, etc.
• Amazon Web Services, Microsoft Azure,
Google Cloud Platform
• Streaming analytics: Streamparse for
Apache Storm, Spark Streaming, Kafka,
Python integration with ELK
Anaconda Technology Partners:
• Cloudera
• Hortonworks
• IBM
• H2O
• Docker
• … and more
#OpenDataScienceMeans #AnacondaCON
SCALING OUT WITH ANACONDA
Anaconda platform
ClusterBiz Analysts, Data Scientists Developers,Data Engineers, DevOps
#OpenDataScienceMeans #AnacondaCON
SCALING OUT WITH ANACONDA
Without Anaconda Scale
Head Node1. Manually install Python,
packages & dependencies2. Manually install R, packages &
dependencies
With Anaconda Scale
Compute Nodes1. Manually install Python,
packages & dependencies2. Manually install R,
packages & dependencies
Compute Nodes
Head NodeEasily install Anaconda with performance optimized Python and R packages and manage environments across all nodes in a cluster
#OpenDataScienceMeans #AnacondaCON
SCALING OUT WITH ANACONDA –EXAMPLE USE CASES
Analyzing text, tabular, or array data using Dask
• Use Pandas dataframes orNumPy arrays at scale
• Work with data in different formats and data stores
Distributed natural language processing with text data using PySpark
• Explore data using a distributed memory cluster
• Interactively query and analyze data using libraries from Anaconda
Distributed machine learning workflows with Dask, Spark, H2O, Tensorflow, and more
• Work interactively and collaboratively in notebooks
• Simplify installation and management of ML libraries and dependencies
Handle custom code and workflows usingDask
• Work with custom data formats
• Construct complex pipelines including ETL and flexible computations
#OpenDataScienceMeans #AnacondaCON
#OpenDataScienceMeans #AnacondaCON
#OpenDataScienceMeans #AnacondaCON
#OpenDataScienceMeans #AnacondaCON
PRODUCTIONIZING DATA SCIENCE PROJECTS
#OpenDataScienceMeans #AnacondaCON
PRODUCTIONIZING DATA SCIENCE PROJECTS
• Provisioning compute resources
• Managing dependencies and environments
• Ensuring availability, uptime, and monitoring status
• Engineering for scalability
• Sharing compute resources
• Securing data and network connectivity and credentials
• Securing network communications and SSL
• Managing authentication and access control
#OpenDataScienceMeans #AnacondaCON
DEPLOYING WITHCOLLABORATIVE DATA SCIENCE WORKFLOWS
Review Design Build Validate Deploy
Assess and review requirements and
data sources
Conceptualdesign of
interactive application or
dashboard
Build the dashboard or
application with Anaconda
Test and validatedashboard or application
Deploy dashboard or application at scale using best
practices
#OpenDataScienceMeans #AnacondaCON
DEPLOYING DATA SCIENCE PROJECTS -NOTEBOOKS
#OpenDataScienceMeans #AnacondaCON
DEPLOYING DATA SCIENCE PROJECTS -DASHBOARDS
#OpenDataScienceMeans #AnacondaCON
DEPLOYING DATA SCIENCE PROJECTS –INTERACTIVE APPLICATIONS
#OpenDataScienceMeans #AnacondaCON
DEPLOYING DATA SCIENCE PROJECTS –MODELS WITH REST APIS
Load Data
Clean Data
Anomaly Detection
Models withREST APIs
DashboardsReports
InteractiveApplications
Regression
Clustering
Machine LearningPipeline
Deployed Applications
Developers and data scientists can build additional layers of visualizations, dashboards, or interactive applications that consume data from API endpoints.
#OpenDataScienceMeans #AnacondaCON
SCALABLE AND DEPLOYABLE DATA SCIENCE… with Anaconda and Anaconda Enterprise, including:
• Scaled-up Analytics: Develop and deploy the same code/environments on your local machine and a cluster
• Environment management: Dynamically manage Python, R, dependencies and other conda packages and environments across a cluster
• Collaboration: Easily share versioned notebooks and projects across users and replicate analysts’ environments for different jobs/users/groups
• Hadoop integration: Support for Hadoop, Spark and other distributed workflows; compatible with enterprise Hadoop distributions
#OpenDataScienceMeans #AnacondaCON
ADDITIONAL RESOURCES FOR SCALABLE AND DEPLOYABLE DATA SCIENCE• Anaconda Enterprise subscriptions:
https://www.continuum.io/anaconda-subscriptions
• Anaconda Scalehttps://docs.continuum.io/anaconda-scale
• Webinars on scaling out with Anacondahttps://www.continuum.io/webinars
• Blog posts on scaling out with Anacondahttps://www.continuum.io/blog/developer-blogProductionizing and Deploying Data Science Projects
Thank You!
@ContinuumIO @koverholt