Python Scalability: A Convenient Truth

Travis Oliphant

Continuum Analytics

• A short tour through data science • Understanding Scale • The Zen of PyData

Overview• Abstract overview

DATA SCIENCE: A SHORT TOUR

wordcount.py8 lines of code

Performance & Limitations• For 1 million rows (45MB CSV), takes ~3 sec on Macbook • … but CSVs are not where Python shines • For a billion rows…?

• Also, data has to fit in memory • For this problem, maybe we can be clever with itertools or

toolz and streaming for out-of-core • The lack of “easy” and “obvious” solutions to larger datasets is

why people tend to believe that “Python doesn’t scale”

Enter the Hadoops!Hadoop decomposes intoMap and Reduce steps:

From http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

Maybe Spark is an improvement?

An Inconvenient Truth

• “Python” libraries in the Hadoop ecosystem are lipstick.py on an elephant in the room: Python in Hadoop is a second class citizen, outside of the most straightforward use cases • Cognitive model is very different • Access to libraries is extremely limited

Query, Exploration, Modeling

• “Core” Hadoop is traditionally concerned with scalable, robust query performance

• Even though Spark and Hadoop enable scale-out, for many data scientists, the usability for ad hoc exploration is dramatically lower than with Python or R because of the reduced interactivity

• In essence, the usability factors around scale-out means that they explore on a single machine, sometimes on a subset of data, and then deploy onto not only a different machine (cluster), but have to recode into different language or APIs

UNDERSTANDING SCALE

Scaling Up vs. Out

Scaling Up

Single Node

C/C++, Python, R

"Native Code"

OperatingSystem

DiskMem

Scale “Up”

cpu0 .... cpu32

ssd0 .... ssd16

dmm0 .... dmm16

C/C++, Python, R

"Native Code"

OperatingSystem

Scaling Out

OperatingSystem

DiskMem

OperatingSystem

DiskMem

DistributedWorker

Distributed Computing API

DistributedWorker

Virtual Distributed Computer (e.g. Hadoop cluster)

DistributedWorker

OperatingSystem

DistributedWorker

OperatingSystem

DistributedWorker

OperatingSystem

DiskMem

Symmetry Ideal

cpu0 .... cpu32

ssd0 .... ssd16

dmm0 .... dmm16

C/C++, Python, R

"Native Code"

OperatingSystem

DistributedWorker

OperatingSystem

DistributedWorker

OperatingSystem

DistributedWorker

OperatingSystem

DiskMem

Scale Up Scale Out

C/C++, Python, R

"Native Code"

Programming Model

DistributedWorker

OperatingSystem

DistributedWorker

OperatingSystem

DistributedWorker

OperatingSystem

DiskMem

C, C++, Python, R Java Scala

Distributed Computing API?

cpu0 .... cpu32

ssd0 .... ssd16

dmm0 .... dmm16

OperatingSystem

HiveImpala

Hadoop

Hadoop Cluster

Java VMJava VMJava VM

HDFS Disk Disk Disk

Hadoop Cluster

HDFS Disk Disk Disk

Java Python R

Hadoop Map-Reduce

Hadoop Cluster

HDFS Disk Disk Disk

CPUCPUCPU

MemMemMem

SparkJava

Scala Python R SQL

PySpark SparkR SparkSQL

Who’s Afraid of Distributed Computing?

• “What’s hard about distributed computing is not space, but time” • The difficulty of the problem scales with your temporal

exposure • The fewer nodes you have, the less complex distributed

computing is • So, optimizing per-node performance can be a lifesaver

• Additionally, the whole reason you use a distributed computer is because of performance. So, performance is an aspect of correctness.

Why Is Python Nice?

• Core data structures, with nice functions and spelling • Facilitate a wide variety of algorithms that interop • Open access to a huge variety of existing libraries and

algorithms • Very easy to get high performance when you need it…

• … on a single machine

Scaling Python

• Most Python data structures and supporting libraries (including C, C++, FORTRAN) were design for single-node use case

• “Automagically” scaling out single-node (“unified memory”) code to distributed machine is hard. • Making it robust is extremely hard to do w/ software. • “When software gets hard, we solve it with hardware”

Scalable Primitives

• What if…. We created a small set of distributed data structures for Python that have this property?

• Not everything from single-node case would work • But many things would • And spelling could at least be made congruent

ANACONDA + HADOOP

Hadoop/Spark Revisited

Hadoop Cluster

HDFS Disk Disk Disk

CPUCPUCPU

MemMemMem

Single Node

CPU(s)Disk(s) Mem

Java VM

HDFSDataNode

Java VM

YARNNode Manager

Distributed APIs

HDFS Spark RDD YARN Cluster

Single Node

CPU(s)Disk(s) Mem

Java VM

HDFSDataNode

Java VM

YARNNode Manager

Distributed APIs

HDFS Spark RDD YARN Cluster

Python

PySpark

Single Node

CPU(s)Disk(s) Mem

Java VM

HDFSDataNode

Java VM

YARNNode Manager

C/C++, Python, R

"Native Code"

Hadoop API

Single Node

Disk(s)

Java VM

Native Code: C, C++, Python, etc.

Mem CPU

hdfs.py

Java VM

Disk(s)

Mem CPU

hdfs.py

Java VM

Hadoop API

Single Node

Disk(s)

Java VM

Mem CPU

hdfs.py

Java VM

Pythonic Distributed

Computing?

Dask: Pythonic Parallelism• A parallel computing framework • That leverages the excellent Python ecosystem • Using blocked algorithms and task scheduling • Written in pure Python

Core Ideas • Dynamic task scheduling yields sane parallelism • Simple library to enable parallelism • Dask.array/dataframe to encapsulate the functionality

Simple Architecture

Core Concepts

dask.array: OOC, parallel, ND array

Arithmetic: +, *, ...

Reductions: mean, max, ...

Slicing: x[10:, 100:50:-2]Fancy indexing: x[:, [3, 1, 2]] Some linear algebra: tensordot, qr, svdParallel algorithms (approximate quantiles, topk, ...)

Slightly overlapping arrays

Integration with HDF5

dask.dataframe: OOC, parallel, ND array

Elementwise operations: df.x + df.yRow-wise selections: df[df.x > 0] Aggregations: df.x.max()groupby-aggregate: df.groupby(df.x).y.max() Value counts: df.x.value_counts()Drop duplicates: df.x.drop_duplicates()Join on index: dd.merge(df1, df2, left_index=True, right_index=True)

DAG of Computation

33http://continuum.io/blog/xray-dask

Scientific Big Data

More Complex Graphs

cross validation

LU Decomposition

dA = da.from_array(A, (3, 3))dP, dL, dU = da.linalg.lu(dA)assert np.allclose(dL.compute(), L) assert np.allclose(dU.compute(), U)dL.visualize()

Dask + Hadoop

Hadoop API

Single Node

Disk(s)

Java VM

Mem CPU

hdfs.py

Java VM

Pythonic Distributed

Computing?Dask

Hadoop

Single Node

Disk(s)

Java VM

Mem CPU

hdfs.py

Scheduler

Java VM

Dask worker

hdfs.py

NYC Taxi Dataset

37http://matthewrocklin.com/blog/work/2016/02/22/dask-distributed-part-2

• Total 50 GB on disk, 120 GB in memory • Creates ~400 sharded Pandas dataframes (128MB each)

Familiar Programming Model

Q: For each payment_type, how many rides didn’t tip?

Familiar Programming Model

Takes ~6 seconds

Tip % by day of week, and by hour of day

People are very generous at 4 a.m.

Reddit Comments Dataset (240 GB)

Easy Text Processing with NLTK

• Pythonic spelling for parallel arrays and dataframes • Direct access to Hadoop data, without paying cost of

JVM serialization, via hdfs3 library • Excited about Cloudera’s Arrow project, which will

further improve direct data access • Load data into distributed memory via persist(), in

fast and data-local fashion • Direct scheduling with YARN via knit library

Take aways

Not only Hadoop!

Dask Scheduler

Single Node

Disk(s)

Mem CPU

Dask worker

dask.bag dask.array dask.dataframe • Dask works well with traditional distributed computing (Sun GridEngine, IPython Parallel, etc.)

• Convenient dec2 library for easily provisioning on EC2

• Excellent for “embarrassingly parallel” numerical problems

Scale Up and Out

• Same Pythonic programming model for scale-up and scale-out

• Works great with existing C, C++, Python fast code

• Threads or processes • Can stream data from disk,

for out-of-core computationSingle Node

Dask Scheduler

dask.bag dask.array dask.dataframe

daskworker

code code code

daskworker

. . . . . . .

ZEN OF PYDATA

Productivity at All Scales

• “Scaling out” is important for production and data processing • “Scaling down” and “Scaling up” is important for agile, fast

iterations in data exploration and modeling • Don’t compromise agility for the sake of future-proofing

scalability • Real wins in being able to use the same language and API

across scales

About Me

• Travis'Oliphant@teoliphantCEO/&/co2founder/Continuum/AnalyticsPh.D./Mayo/Clinic/ in/Biomedical/EngineeringB.S.,/M.S./BYU/Mathematics/&/Electrical/Eng./Open/Source/contributor/and/leader/since/1997

Creator/of/NumPy/and/SciPyStarted/Numba

Author/Guide&to&NumPy

Python Scalability: A Convenient Truth

Data & Analytics

Joining the API billionaires’ club with Python and PostgreSQL€¦ · Recipes for scalability • Manage technical risk • Scale your database with shared-nothing parallelism •

Lustre Scalability

A Python Book: Beginning Python, Advanced Python, and Python … · 2013-10-23 · A Python Book: Beginning Python, Advanced Python, and Python Exercises Author: Dave Kuhlman Address:

Concurrency scalability

scalability - ABOX

Complete Dynamic Multi-cloud Application Management€¦ · • Scalability, reliability, and usability: Python API, Libcloud Driver, and Clojure(Script) API in addition to a general

Olap scalability

Python Scalability Story In Production Environments · PDF fileIntel(R) AVX2; Intel(R)TBB 4.4.4; Ubuntu 14.04.4 LTS; Dask 0.10.0; Numpy 1.11.0. Software and workloads used in performance

DBSight Scalability

Scalability 09262012

TensorFlow for amplitude analyses · 2018-11-22 · Has Python, C++ and Java front-ends. Python is more developed and (IMO) more convenient. Faster development cycle, more compact

Scalability and Heterogeneity · Scalability and Heterogeneity Colin Perkins

REACH Final Submission Scalability and Sustainability · REACH Final Submission – Scalability and Sustainability Your app’s scalability, sustainability, completeness, and user

Real World Web Scalability · • perl, java, php, python, ... • Put the shopping_cart_id in the session ... • IFRAMEs and AJAX can make this really spiffy. Job Queues

Scalability

A Python Book: Beginning Python, Advanced Python, and ... · A Python Book A Python Book: Beginning Python, Advanced Python, and Python Exercises Author: Dave Kuhlman Contact: dkuhlman@davekuhlman.org

pandapower - an Open Source Python Tool for Convenient ... · Convenient Modeling, Analysis and Optimization of Electric Power Systems Leon Thurner, ... and short circuit calculations

Rethinking scalability

Boosting Scalability ofBoosting Scalability of InfiniBand

Broadband Scalability and Performance · Broadband Scalability and Performance Enhancing the Scalability of Per-User Configurations thatincludethelcp:interface-configVSA.Therefore,whenupdatingyouruserprofiles,werecommendthat