View
16.800
Download
2
Category
Preview:
Citation preview
2
• A short tour through data science • Understanding Scale • The Zen of PyData
Overview• Abstract overview
5
Performance & Limitations• For 1 million rows (45MB CSV), takes ~3 sec on Macbook • … but CSVs are not where Python shines • For a billion rows…?
• Also, data has to fit in memory • For this problem, maybe we can be clever with itertools or
toolz and streaming for out-of-core • The lack of “easy” and “obvious” solutions to larger datasets is
why people tend to believe that “Python doesn’t scale”
6
Enter the Hadoops!Hadoop decomposes intoMap and Reduce steps:
From http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
An Inconvenient Truth
8
• “Python” libraries in the Hadoop ecosystem are lipstick.py on an elephant in the room: Python in Hadoop is a second class citizen, outside of the most straightforward use cases • Cognitive model is very different • Access to libraries is extremely limited
Query, Exploration, Modeling
9
• “Core” Hadoop is traditionally concerned with scalable, robust query performance
• Even though Spark and Hadoop enable scale-out, for many data scientists, the usability for ad hoc exploration is dramatically lower than with Python or R because of the reduced interactivity
• In essence, the usability factors around scale-out means that they explore on a single machine, sometimes on a subset of data, and then deploy onto not only a different machine (cluster), but have to recode into different language or APIs
Scaling Up
12
Single Node
C/C++, Python, R
"Native Code"
OperatingSystem
CPU
DiskMem
Scale “Up”
cpu0 .... cpu32
ssd0 .... ssd16
dmm0 .... dmm16
C/C++, Python, R
"Native Code"
OperatingSystem
Scaling Out
13
OperatingSystem
CPU
DiskMem
OperatingSystem
OperatingSystem
CPU
DiskMem
CPU
DiskMem
DistributedWorker
Distributed Computing API
DistributedWorker
DistributedWorker
Virtual Distributed Computer (e.g. Hadoop cluster)
DistributedWorker
OperatingSystem
DistributedWorker
OperatingSystem
DistributedWorker
OperatingSystem
CPU
DiskMem
CPU
DiskMem
CPU
DiskMem
Distributed Computing API
Symmetry Ideal
14
cpu0 .... cpu32
ssd0 .... ssd16
dmm0 .... dmm16
C/C++, Python, R
"Native Code"
OperatingSystem
Virtual Distributed Computer (e.g. Hadoop cluster)
DistributedWorker
OperatingSystem
DistributedWorker
OperatingSystem
DistributedWorker
OperatingSystem
CPU
DiskMem
CPU
DiskMem
CPU
DiskMem
Distributed Computing API
Scale Up Scale Out
C/C++, Python, R
"Native Code"
Programming Model
15
Virtual Distributed Computer (e.g. Hadoop cluster)
DistributedWorker
OperatingSystem
DistributedWorker
OperatingSystem
DistributedWorker
OperatingSystem
CPU
DiskMem
CPU
DiskMem
CPU
DiskMem
C, C++, Python, R Java Scala
Distributed Computing API?
cpu0 .... cpu32
ssd0 .... ssd16
dmm0 .... dmm16
OperatingSystem
HiveImpala
SQL
Hadoop
16
Hadoop Cluster
Java VMJava VMJava VM
YARN
HDFS Disk Disk Disk
CPU
Mem
CPU
Mem
CPU
Mem
Hadoop Cluster
Java VMJava VMJava VM
YARN
HDFS Disk Disk Disk
CPU
Mem
CPU
Mem
CPU
Mem
Java Python R
Hadoop Map-Reduce
Hadoop Cluster
Java VMJava VMJava VM
RDD
YARN
HDFS Disk Disk Disk
CPUCPUCPU
MemMemMem
SparkJava
Spark
Scala Python R SQL
PySpark SparkR SparkSQL
Who’s Afraid of Distributed Computing?
17
• “What’s hard about distributed computing is not space, but time” • The difficulty of the problem scales with your temporal
exposure • The fewer nodes you have, the less complex distributed
computing is • So, optimizing per-node performance can be a lifesaver
• Additionally, the whole reason you use a distributed computer is because of performance. So, performance is an aspect of correctness.
Why Is Python Nice?
18
• Core data structures, with nice functions and spelling • Facilitate a wide variety of algorithms that interop • Open access to a huge variety of existing libraries and
algorithms • Very easy to get high performance when you need it…
• … on a single machine
Scaling Python
19
• Most Python data structures and supporting libraries (including C, C++, FORTRAN) were design for single-node use case
• “Automagically” scaling out single-node (“unified memory”) code to distributed machine is hard. • Making it robust is extremely hard to do w/ software. • “When software gets hard, we solve it with hardware”
Scalable Primitives
20
• What if…. We created a small set of distributed data structures for Python that have this property?
• Not everything from single-node case would work • But many things would • And spelling could at least be made congruent
Hadoop/Spark Revisited
22
Hadoop Cluster
Java VMJava VMJava VM
RDD
YARN
HDFS Disk Disk Disk
CPUCPUCPU
MemMemMem
Single Node
CPU(s)Disk(s) Mem
Java VM
HDFSDataNode
Java VM
Spark
Java VM
YARNNode Manager
Distributed APIs
HDFS Spark RDD YARN Cluster
23
Single Node
CPU(s)Disk(s) Mem
Java VM
HDFSDataNode
Java VM
Spark
Java VM
YARNNode Manager
Distributed APIs
HDFS Spark RDD YARN Cluster
Python
PySpark
24
Single Node
CPU(s)Disk(s) Mem
Java VM
HDFSDataNode
Java VM
Spark
Java VM
YARNNode Manager
C/C++, Python, R
"Native Code"
Hadoop API
Single Node
Disk(s)
Java VM
YARN
Native Code: C, C++, Python, etc.
Mem CPU
HDFS
hdfs.py
YARN
Java VM
HDFS
25
Disk(s)
Native Code: C, C++, Python, etc.
Mem CPU
hdfs.py
Java VM
HDFS
Java VM
YARN
26
Hadoop API
Single Node
Disk(s)
Java VM
YARN
Native Code: C, C++, Python, etc.
Mem CPU
HDFS
hdfs.py
YARN
Java VM
HDFS
Pythonic Distributed
Computing?
27
Dask: Pythonic Parallelism• A parallel computing framework • That leverages the excellent Python ecosystem • Using blocked algorithms and task scheduling • Written in pure Python
Core Ideas • Dynamic task scheduling yields sane parallelism • Simple library to enable parallelism • Dask.array/dataframe to encapsulate the functionality
dask.array: OOC, parallel, ND array
30
Arithmetic: +, *, ...
Reductions: mean, max, ...
Slicing: x[10:, 100:50:-2]Fancy indexing: x[:, [3, 1, 2]] Some linear algebra: tensordot, qr, svdParallel algorithms (approximate quantiles, topk, ...)
Slightly overlapping arrays
Integration with HDF5
dask.dataframe: OOC, parallel, ND array
31
Elementwise operations: df.x + df.yRow-wise selections: df[df.x > 0] Aggregations: df.x.max()groupby-aggregate: df.groupby(df.x).y.max() Value counts: df.x.value_counts()Drop duplicates: df.x.drop_duplicates()Join on index: dd.merge(df1, df2, left_index=True, right_index=True)
LU Decomposition
35
dA = da.from_array(A, (3, 3))dP, dL, dU = da.linalg.lu(dA)assert np.allclose(dL.compute(), L) assert np.allclose(dU.compute(), U)dL.visualize()
36
Dask + Hadoop
Hadoop API
Single Node
Disk(s)
Java VM
YARN
Native Code: C, C++, Python, etc.
Mem CPU
HDFS
hdfs.py
YARN
Java VM
HDFS
Pythonic Distributed
Computing?Dask
Hadoop
Single Node
Disk(s)
Java VM
YARN
Native Code: C, C++, Python, etc.
Mem CPU
HDFS
hdfs.py
YARN
Scheduler
Java VM
HDFS
Dask worker
hdfs.py
NYC Taxi Dataset
37http://matthewrocklin.com/blog/work/2016/02/22/dask-distributed-part-2
• Total 50 GB on disk, 120 GB in memory • Creates ~400 sharded Pandas dataframes (128MB each)
Familiar Programming Model
39
Takes ~6 seconds
Tip % by day of week, and by hour of day
People are very generous at 4 a.m.
42
• Pythonic spelling for parallel arrays and dataframes • Direct access to Hadoop data, without paying cost of
JVM serialization, via hdfs3 library • Excited about Cloudera’s Arrow project, which will
further improve direct data access • Load data into distributed memory via persist(), in
fast and data-local fashion • Direct scheduling with YARN via knit library
Take aways
Not only Hadoop!
43
Dask Scheduler
Single Node
Disk(s)
Native Code: C, C++, Python, etc.
Mem CPU
Dask worker
dask.bag dask.array dask.dataframe • Dask works well with traditional distributed computing (Sun GridEngine, IPython Parallel, etc.)
• Convenient dec2 library for easily provisioning on EC2
• Excellent for “embarrassingly parallel” numerical problems
Scale Up and Out
44
• Same Pythonic programming model for scale-up and scale-out
• Works great with existing C, C++, Python fast code
• Threads or processes • Can stream data from disk,
for out-of-core computationSingle Node
Dask Scheduler
dask.bag dask.array dask.dataframe
cpu0
dmm0
ssd0
cpu1
dmm1
ssd1
cpuN
dmmN
ssdN
daskworker
code code code
daskworker
daskworker
. . . . . . .
Productivity at All Scales
46
• “Scaling out” is important for production and data processing • “Scaling down” and “Scaling up” is important for agile, fast
iterations in data exploration and modeling • Don’t compromise agility for the sake of future-proofing
scalability • Real wins in being able to use the same language and API
across scales
Recommended