Upload
continuum-analytics
View
259
Download
1
Embed Size (px)
Citation preview
© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary
Anaconda –Open Data Science Platform
© 2016 Continuum Analytics - Confidential & Proprietary
Continuum Analytics is the company behind Anaconda and offers:
– Open-Source Software
– Commercial Software
– Training
– Consulting
is….the leading Open Data Science platform powered by Python the fastest growing open data science language
Accelerate, Connect & Empower
© 2016 Continuum Analytics - Confidential & Proprietary
Quickly Engage with Your Data
Modern, Open Data Science Platform powered by PythonAnaconda
– 730+ Popular Python & R packages
– Compiled for Windows, Mac, and Linux
– Package Distribution is free for everyone
– Foundation of our Enterprise Platform
– Extensible via Conda Package Manager
– Easily sandbox and deploy packages & analytical computing environments
© 2016 Continuum Analytics - Confidential & Proprietary 44
Anaconda…is Trusted by Industry Leaders
Financial Services• Risk management, Quant modeling, Data
exploration and processing, algorithmic trading, compliance reporting
Government• Fraud detection, data crawling, web & cyber data
analytics, statistical modelingHealthcare & Life Sciences• Genomics data processing, cancer research,
natural language processing for health data scienceHigh Tech• Customer behavior, recommendations, ad bidding,
retargeting, social media analyticsRetail & CPG• Engineering simulation, supply chain modeling,
scientific analysisOil & Gas• Pipeline monitoring, noise logging, seismic data
processing, geophysics
© 2016 Continuum Analytics - Confidential & Proprietary
Env 1
Python 2.7
Conda: Package and Environment ManagementEnv 2
Python 3.4
Pandas v.0.18
Jupyter
Env 3
R
R Essentials
conda
Windows, Mac OSX, Linux
– Install packages
– Update packages
– Create sandboxes: Conda environments
– Conda environments: Critical for reproducibility, collaboration & scale
NumPyv1.11
NumPyv1.10
Pandas v.0.16
© 2016 Continuum Analytics - Confidential & Proprietary 66
Continuum Sponsored Open-Source Projects
• Bokeh - Interactive Web Visualizations
and Applications
• Dask – Painless distributed and parallel
computations in Python
• Numba - JIT for Python applications
• Jupyter, Spyder – Notebooks and IDE
for data science
• Pandas, Datashader, Blaze, …
© 2016 Continuum Analytics - Confidential & Proprietary 77
Anaconda• High performance Python &
R• 720+ data science
packages• Cross-platform package,
dependency & environments
• Community driven package repository collaboration
Anaconda Navigator• Desktop Portal & Installer
Anaconda Enterprise Components
OPEN DATA SCIENCE
DATA SCIENCE GOVERNANCE
DATA SCIENCE COLLABORATION
Anaconda Repository• Storage & sharing of
packages, environments, notebooks
• On-premise governance• Enterprise authentication
Anaconda• Deep Learning: Theano,
Tensorflow, Caffe, Keras, Neon, Lasagne
• Natural Language Processing: NLTK, spaCy
• Machine Learning: Scikit-learn
• GPU enablement
Anaconda Enterprise Notebooks
• Collaborative project based workflows for Python & R
• Enterprise authentication & permissioning
• Notebook sharing, versioning, search, differencing
Anaconda• Interactive browser based
dashboards & visualizations with Bokeh
• Bokeh apps using Python, R, Scala
DATA SCIENCE FOR BIG DATA
Anaconda Scale • Hadoop & Spark integration• Scalable distributed
processing framework• Integration with resource
management & data stores• Distributed package,
dependency & environments
Anaconda Fusion• Integration of Open Data
Science with Microsoft Excel®
• Big Data querying & transformations
© 2016 Continuum Analytics - Confidential & Proprietary
On-premises package repository– Governance for your analytics environment– Empower your data scientists within the
structure of enterprise IT
Enterprise notebook collaboration– Easily replicate and share analysts’
environments– Centrally store proprietary libraries and
manage versioning
Scalable analytics computations– Scale up: leverage GPU and parallel-
optimized libraries
– Scale out: easily manage Anaconda across your Hadoop/Spark cluster
– Scale up and out with Python and R
Enterprise data science deployment– Encapsulate and deploy data science projects
– Deploy live notebooks, dashboards, interactive applications, and models with REST APIs
Anaconda EnterpriseOpen Source Without Anxiety: Governance and Scalability
© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary
Scaling Out with Anaconda
© 2016 Continuum Analytics - Confidential & Proprietary
Anaconda - Scaled Out Open Data Science
Application and Visualization Jupyter Notebook, Matplotlib, seaborn, Bokeh, etc.
Analytics pandas, NumPy, SciPy, Numba, scikit-learn, NLTK, scikit-image, PIL, and more
Computation PySpark, SparkR Dask, Distributed
Data and Resource Management HDFS, NFS, YARN, SGE, SLURM
Servers Bare-metal or Cloud-based Cluster Clus
ter
Anac
onda
© 2016 Continuum Analytics - Confidential & Proprietary
Spectrum of Parallelization
ThreadsProcesses
MPIZeroMQ
Explicit control: Fast but low-level Implicit control: Restrictive but easy
Dask HadoopSpark
SQL:HivePig
Impala
© 2016 Continuum Analytics - Confidential & Proprietary
Scaling Out with Anaconda and Spark
• Using Anaconda with Spark is:• Extensible: Use libraries from Anaconda with PySpark and SparkR jobs• Integrated: Use interactive notebooks with data in HDFS and on YARN
clusters• Secure: Works with Kerberized Hadoop clusters• Scalable: Map pandas, NumPy, SciPy jobs on large clusters and data sets• Seamless: Works with Cloudera CDH, Hortonworks HDP, and other enterprise
Hadoop distributions• Anaconda dramatically simplifies the installation and management of popular
Python and R packages and their dependencies.
© 2016 Continuum Analytics - Confidential & Proprietary
Other ways to Scale Out with Anaconda• Anaconda integrates with:• Spark (PySpark, SparkR) and other
Hadoop components, including YARN, HDFS, Hive, Impala, and more
• Dask, Distributed, knit, dask-ec2, hdfs3, fastparquet
• CSV, SQL, JSON, HDF5, Parquet, etc.• Amazon Web Services, Microsoft Azure,
Google Cloud Platform
• Streaming analytics: Streamparse for Apache Storm, Spark Streaming, Kafka, Python integration with ELK
• Anaconda Technology Partners:• Cloudera• Hortonworks• IBM• H2O• Docker• … and more
© 2016 Continuum Analytics - Confidential & Proprietary
Scaling Out with Anaconda
Without Anaconda Scale
Head Node1. Manually install Python,
packages & dependencies2. Manually install R, packages &
dependencies
With Anaconda Scale
Compute Nodes1. Manually install Python,
packages & dependencies
2. Manually install R, packages & dependencies
Compute Nodes
Head NodeEasily install Anaconda with performance optimized Python and R packages and manage environments across all nodes in a cluster
© 2016 Continuum Analytics - Confidential & Proprietary
Scaling Out with Anaconda –Example Use Cases
Analyzing text, tabular, or array data using Dask
• Use Pandas dataframes orNumPy arrays at scale
• Work with data in different formats and data stores
Distributed natural language processing with text data using PySpark
• Explore data using a distributed memory cluster
• Interactively query and analyze data using libraries from Anaconda
Distributed machine learning workflows with Dask, Spark, H2O, Tensorflow, and more
• Work interactively and collaboratively in notebooks
• Simplify installation and management of ML libraries and dependencies
Handle custom code and workflows usingDask
• Work with custom data formats
• Construct complex pipelines including ETL and flexible computations
© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary
Productionizing and Deploying Data Science Projects
© 2016 Continuum Analytics - Confidential & Proprietary
Productionizing Data Science Projects
© 2016 Continuum Analytics - Confidential & Proprietary
Deploying Data Science Projects - Notebooks
© 2016 Continuum Analytics - Confidential & Proprietary
Deploying Data Science Projects - Dashboards
© 2016 Continuum Analytics - Confidential & Proprietary
Deploying Data Science Projects – Interactive Applications
© 2016 Continuum Analytics - Confidential & Proprietary
Deploying Data Science Projects –Models with REST APIsLoad Data
Clean Data
Anomaly Detection
Models withREST APIs
DashboardsReports
InteractiveApplications
Regression
Clustering
Machine LearningPipeline
Deployed Applications
Developers and data scientists can build additional layers of visualizations, dashboards, or interactive applications that consume data from API endpoints.
© 2016 Continuum Analytics - Confidential & Proprietary
Scalable and Deployable Data Science
• … with Anaconda and Anaconda Enterprise, including:• Scaled-up Analytics: Develop and deploy the same code/environments
on your local machine and a cluster
• Environment management: Dynamically manage Python, R, dependencies and other conda packages and environments across a cluster
• Collaboration: Easily share versioned notebooks and projects across users and replicate analysts’ environments for different jobs/users/groups
• Hadoop integration: Support for Hadoop, Spark and other distributed workflows; compatible with enterprise Hadoop distributions
© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary
Data Science Deploymentin Anaconda Enterprise
© 2016 Continuum Analytics - Confidential & Proprietary
ENSURES AVAILABILITY,
UPTIME, & MONITORING
PROVISIONS COMPUTE
RESOURCES
MANAGES DEPENDENCIES & ENVIRONMENTS
SHARE COMPUTE
RESOURCESSECURE
NETWORK COMMUNICATIONS
& SSL
SECURE DATA & NETWORK
CONNECTIVITY
ENGINEER FOR SCALABILITY
MANAGE AUTHENTICATION & ACCESS CONTROL
SCHEDULE REGULAR
EXECUTION OF JOBS
With Anaconda Enterprise life just got a whole lot easier…
Learn more: https://www.continuum.io/blog/developer-blog/productionizing-deploying-data-science-projects
© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary
Anaconda Fusion
© 2016 Continuum Analytics - Confidential & Proprietary 26
Anaconda Fusion brings Open Data Science to Microsoft Excel
AnacondaFusion
• BRING interactive visualizations, machine learning and ETL to Excel
• BRIDGE Excel Data to Python & R through notebooks
• ACCESS all the power of Python and Big Data, natively embedded inside Excel
© 2016 Continuum Analytics - Confidential & Proprietary
Empowering Business Analysts & Data-driven Employees
• Anaconda Fusion is a Microsoft Excel® Add-in that enables a unique and simple link between Excel and Python without writing code
• Anaconda Fusion is targeted to Business Analysts who want “No Code” Data Science
© 2016 Continuum Analytics - Confidential & Proprietary
Analysts and Data Scientists can keep using their prefered tools
28
© 2016 Continuum Analytics - Confidential & Proprietary
“No Code” Data Science – Data Loading Example
1 2Select Anaconda Fusion Notebook and click “Upload”
Select function you wish to run
Click “Run” Data is loaded into spreadsheet3 4
© 2016 Continuum Analytics - Confidential & Proprietary
Just change one line of code in your notebook
© 2016 Continuum Analytics - Confidential & Proprietary
• Extract data - pull data directly into Excel to perform analysis
• Machine Learning – use trained models created by Data Scientists and plug them into your spreadsheet data
• Interactive Visualizations – create custom advanced interactive graphs, charts and plots from Excel data
• Big Data – analyze, transform, model and query data stored in Hadoop and Spark
Figure: Anaconda Fusion on Mac
Anaconda Fusion Use Cases