New Capabilities in the PyData Ecosystem
Peter Wang Continuum Analytics @pwang
Data Science @ NYT
Python & SciPy
• High performance linear algebra, image processing, optimization via NumPy, optimized C++, FORTRAN
• Large structured data via HDF5, memmap • Out of core processing, streaming & realtime • Distributed computing via MPI, IPython Parallel, etc. • GPU & heterogenous via OpenCL, PyCUDA, others • Massive adoption in research, national labs, industry
(engineering, finance, etc.)
• IPython Notebook: 2005-2011 • pandas: 2008-2009 • scikit-learn: 2007 • NumPy: 2006 • matplotlib: 2002 • IPython: 2001 • Numarray: 2001 • SciPy: 1999 • Numeric: 1995
Python has >15 year history in scientific computing
"Python's Scientific Ecosystem"
@jakevdp
"Many More Tools"
@jakevdp
Focus On
• Bokeh
• Dask
Focus On
• Bokeh
• Dask
• Blaze, odo
• dynd
• xray
• NumPy
• Pandas
• PyTables & h5py
• Beaker Notebook
• IPython widgets, JupyterHub
• conda, Anaconda Cluster
• Docker
• Docker
• Docker
Not Gonna Talk About...
Focus On
• Bokeh
• Dask
• Blaze, odo
• dynd
• xray
• NumPy
• Pandas
• PyTables & h5py
• Beaker Notebook
• IPython widgets, JupyterHub
• conda, Anaconda Cluster
• Docker
• Docker
• Docker
Not Gonna Talk About...
Bokeh
• Interactive visualization
• Novel graphics
• Streaming, dynamic, large data
• For the browser, with or without a server
• No need to write Javascript
• Support for R, Scala, Julia, Lua
http://bokeh.pydata.org
Dashboards & Data Apps
Static Notebooks/HTML, Interactive Plots
http://nbviewer.ipython.org/github/bokeh/bokeh-notebooks/blob/master/tutorial/00%20-%20intro.ipynb#Interaction
Extensible Architecture
server.py BrowserApp Model
BokehJS object graph
bokeh-serverbokeh.py object graph
JSON
Dask
Example: Ocean Temp Data• http://www.esrl.noaa.gov/psd/data/gridded/
data.noaa.oisst.v2.highres.html • Every 1/4 degree, 720x1440 array each day
Example: Ocean Temp Data• http://www.esrl.noaa.gov/psd/data/gridded/
data.noaa.oisst.v2.highres.html • Every 1/4 degree, 720x1440 array each day
Bigger Data
36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressedIf you don't have this much RAM...
Bigger Data
36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressedIf you don't have this much RAM...
... better start chunking.
DAG of Computation
Dask: Out of Core Scheduler for Python
Dask: Out of Core Scheduler for Python• A parallel computing framework
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python
Core Ideas
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python
Core Ideas• Dynamic task scheduling yields sane parallelism
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python
Core Ideas• Dynamic task scheduling yields sane parallelism• Simple library to enable parallelism
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python
Core Ideas• Dynamic task scheduling yields sane parallelism• Simple library to enable parallelism• Dask.array/dataframe to encapsulate the functionality
Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python
Core Ideas• Dynamic task scheduling yields sane parallelism• Simple library to enable parallelism• Dask.array/dataframe to encapsulate the functionality• Distributed scheduler coming
Simple Architecture
Core Concepts
dask.array: OOC, parallel, ND array
Arithmetic: +, *, ...
Reductions: mean, max, ...
Slicing: x[10:, 100:50:-2]Fancy indexing: x[:, [3, 1, 2]] Some linear algebra: tensordot, qr, svdParallel algorithms (approximate quantiles, topk, ...)
Slightly overlapping arrays
Integration with HDF5
dask.dataframe: OOC, parallel, ND array
Elementwise operations: df.x + df.yRow-wise selections: df[df.x > 0] Aggregations: df.x.max()groupby-aggregate: df.groupby(df.x).y.max() Value counts: df.x.value_counts()Drop duplicates: df.x.drop_duplicates()Join on index: dd.merge(df1, df2, left_index=True, right_index=True)
More Complex Graphs
cross validation
http://continuum.io/blog/xray-dask
PyData's Future
• Dozens of international meetup groups • Intl conferences each year, including collab
with EuroPython, Strata, and others • More companies investing in the ecosystem
• Dato - SFrame, SGraph, ... • Cloudera - Impyla, Ibis, ... • Microsoft - Python in AzureML • Databricks - PySpark • Continuum - *.*