New Capabilities in the PyData Ecosystem

Preview:

Citation preview

New Capabilities in the PyData Ecosystem

Peter Wang Continuum Analytics @pwang

Data Science @ NYT

Python & SciPy

• High performance linear algebra, image processing, optimization via NumPy, optimized C++, FORTRAN

• Large structured data via HDF5, memmap • Out of core processing, streaming & realtime • Distributed computing via MPI, IPython Parallel, etc. • GPU & heterogenous via OpenCL, PyCUDA, others • Massive adoption in research, national labs, industry

(engineering, finance, etc.)

• IPython Notebook: 2005-2011 • pandas: 2008-2009 • scikit-learn: 2007 • NumPy: 2006 • matplotlib: 2002 • IPython: 2001 • Numarray: 2001 • SciPy: 1999 • Numeric: 1995

Python has >15 year history in scientific computing

"Python's Scientific Ecosystem"

@jakevdp

"Many More Tools"

@jakevdp

Focus On

• Bokeh

• Dask

Focus On

• Bokeh

• Dask

• Blaze, odo

• dynd

• xray

• NumPy

• Pandas

• PyTables & h5py

• Beaker Notebook

• IPython widgets, JupyterHub

• conda, Anaconda Cluster

• Docker

• Docker

• Docker

Not Gonna Talk About...

Focus On

• Bokeh

• Dask

• Blaze, odo

• dynd

• xray

• NumPy

• Pandas

• PyTables & h5py

• Beaker Notebook

• IPython widgets, JupyterHub

• conda, Anaconda Cluster

• Docker

• Docker

• Docker

Not Gonna Talk About...

Bokeh

• Interactive visualization

• Novel graphics

• Streaming, dynamic, large data

• For the browser, with or without a server

• No need to write Javascript

• Support for R, Scala, Julia, Lua

http://bokeh.pydata.org

Dashboards & Data Apps

Static Notebooks/HTML, Interactive Plots

http://nbviewer.ipython.org/github/bokeh/bokeh-notebooks/blob/master/tutorial/00%20-%20intro.ipynb#Interaction

Extensible Architecture

server.py BrowserApp Model

BokehJS object graph

bokeh-serverbokeh.py object graph

JSON

rBokeh

http://hafen.github.io/rbokeh

Dask

Example: Ocean Temp Data• http://www.esrl.noaa.gov/psd/data/gridded/

data.noaa.oisst.v2.highres.html • Every 1/4 degree, 720x1440 array each day

Example: Ocean Temp Data• http://www.esrl.noaa.gov/psd/data/gridded/

data.noaa.oisst.v2.highres.html • Every 1/4 degree, 720x1440 array each day

Bigger Data

36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressedIf you don't have this much RAM...

Bigger Data

36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressedIf you don't have this much RAM...

... better start chunking.

DAG of Computation

Dask: Out of Core Scheduler for Python

Dask: Out of Core Scheduler for Python• A parallel computing framework

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python

Core Ideas

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python

Core Ideas• Dynamic task scheduling yields sane parallelism

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python

Core Ideas• Dynamic task scheduling yields sane parallelism• Simple library to enable parallelism

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python

Core Ideas• Dynamic task scheduling yields sane parallelism• Simple library to enable parallelism• Dask.array/dataframe to encapsulate the functionality

Dask: Out of Core Scheduler for Python• A parallel computing framework• That leverages the excellent Python ecosystem• Using blocked algorithms and task scheduling• Written in pure Python

Core Ideas• Dynamic task scheduling yields sane parallelism• Simple library to enable parallelism• Dask.array/dataframe to encapsulate the functionality• Distributed scheduler coming

Simple Architecture

Core Concepts

dask.array: OOC, parallel, ND array

Arithmetic: +, *, ...

Reductions: mean, max, ...

Slicing: x[10:, 100:50:-2]Fancy indexing: x[:, [3, 1, 2]] Some linear algebra: tensordot, qr, svdParallel algorithms (approximate quantiles, topk, ...)

Slightly overlapping arrays

Integration with HDF5

dask.dataframe: OOC, parallel, ND array

Elementwise operations: df.x + df.yRow-wise selections: df[df.x > 0] Aggregations: df.x.max()groupby-aggregate: df.groupby(df.x).y.max() Value counts: df.x.value_counts()Drop duplicates: df.x.drop_duplicates()Join on index: dd.merge(df1, df2, left_index=True, right_index=True)

More Complex Graphs

cross validation

http://continuum.io/blog/xray-dask

PyData's Future

• Dozens of international meetup groups • Intl conferences each year, including collab

with EuroPython, Strata, and others • More companies investing in the ecosystem

• Dato - SFrame, SGraph, ... • Cloudera - Impyla, Ibis, ... • Microsoft - Python in AzureML • Databricks - PySpark • Continuum - *.*