Upload
lycong
View
247
Download
0
Embed Size (px)
Citation preview
Scaling High-Performance Python with Minimal Effort
1
Ehsan Totoni
Research Scientist, Intel Labs
STAC Summit NYC, Nov. 1st, 2017
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability,
fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course
of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here
is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications
and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from
published specifications. Current characterized errata are available on request.
Intel, the Intel logo, Intel Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
2017 © Intel Corporation.
*Other names and brands may be claimed as the property of others
2
Legal Disclaimer
Data analytics is the greatest value driver in technology
Financial services need insights from data
• Exploit market data for financial modeling, etc.
High performance big data analytics is crucial
• Democratize HPC for data scientists
3
Motivation
http://www.businesscomputingworld.co.uk/
Scripting languages like Python are productive but slow and serial
Big data frameworks (Hadoop/Spark) are hard to use and slow
• High overhead runtime libraries
• Not based on parallel computing fundamentals
High performance requires low-level programming
• Not practical for interactive workflows of data scientists and their expertise
4
Productivity-Performance Gap
python.org
llnl.orgisocpp.orgInfoobjects.com
5
Motivation
127.22
64.1
4.28
2.151.11
1
10
100
1 (36) 2 (72) 4 (144)
Exe
cu
tio
n T
ime
(s)
Amazon AWS c4.8xlarge instances (vCPUs)
Spark
MPI/C++
53x
Logistic Regression on Amazon AWS
Totoni et al. “A Case Against Tiny Tasks in Iterative Analytics”, HotOS’17
NOT STAC BENCHMARK
High performance/scalability for analytics/ML/AI with little effort
• Minimal changes to scripting source code
Compiler optimization and parallelization
• Scripting program → efficient parallel binary
High Performance Analytics Toolkit (HPAT)
• Python (previously Julia)
6
Overview
https://github.com/IntelLabs/HPAT.jl
python.org
https://github.com/IntelLabs/hpat
7
HPAT Python Example
@hpat.jit
def logistic_regression(iterations):
f = h5py.File("lr.hdf5", "r")
X = f['points'][:]
Y = f['responses'][:]
D = X.shape[1]
w = np.ones(D) - 0.5
for i in range(iterations):
w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y), X)
return w
33x speedup
on 4 nodes
Example launch command:
mpirun -n 144 python logistic_regression.py
Numpy code is implicitly data-parallel
8
Data Parallelism Extraction
1Anderson et al. “Parallelizing Julia with a Non-invasive DSL”, ECOOP’17
D = A * B + C
parfor i=1:n
t[i]=A[i]*B[i]
parfor i=1:n
D[i]=t[i]+C[i]
parfor i=1:n
t[i]=A[i]*B[i]+C[i]
Recognize parallelism
Fuse loops
*
+
A
B
C
=D
9
Spark Workflow HPAT Workflow
Python code Spark API code
Spark Runtime
Python code
Cluster/cloud
Parallel binary (MPI)
Cluster/cloud
Rewrite by programmer
Compile by HPAT
Driver
Executor 1 …
…
Rank 0 …Rank 1 Rank N-1
Driver
Executor N-1Executor 0
10
Performance Evaluation
61
767 830351
0.013
10.5
0.24
0.03
2.060.98 0.95
0.01
0.1
1
10
100
1000
KernelDensity
LinearRegression
LogisticRegression
K-MeansE
xe
cutio
n t
ime
(s)
Spark MPI/C++ HPAT
46.2102
64.1
547
0.08
1.47 1.09 0.83
0.18
5.08
1.812.91
0.01
0.1
1
10
100
1000
Kernel Density LinearRegression
LogisticRegression
K-Means
Exe
cutio
n t
ime
(s)
Spark MPI/C++ HPAT
20x-256x speedup of HPAT vs Spark
Cori at NERSC/LBL
64 nodes (2048 cores)
Amazon AWS
4 nodes c4.8xlarge (144 vCPUs)
370x-2000x speedup of HPAT vs Spark
HPAT is within 2-4x MPI/C++HPAT Julia used, Python will be similar
Totoni et al. “HPAT: High Performance Analytics with Scripting Ease-of-Use”, ICS’17
NOT STAC BENCHMARKS
11
Pandas Example
@hpat.jit(locals={'s_open': hpat.float64[:], …})
def intraday_mean_revert():
f = h5py.File("stock_data.hdf5", "r"); …
for i in prange(nsyms):
symbol = sym_list[i]
s_open = f[symbol+'/Open'][:]; …
df = pd.DataFrame({'Open': s_open, …})
df['Stdev'] = df['Close'].rolling(window=90).std()
df['Moving Average'] = df['Close'].rolling(window=20).mean()
df['Criteria1'] = (df['Open'] - df['Low'].shift(1)) < -df['Stdev']
df['Criteria2'] = df['Open'] > df['Moving Average']
df['BUY'] = df['Criteria1'] & df['Criteria2']
df['Pct Change'] = (df['Close'] - df['Open']) / df['Open']
df['Rets'] = df['Pct Change'][df['BUY'] == True]
n_days = len(df['Rets'])
res = np.zeros(max_num_days)
if n_days:
res[-n_days:] = df['Rets'].fillna(0)
all_res += res
100x speedup
on 36 cores
http://www.pythonforfinance.net/2017/02/20/intraday-stock-mean-reversion-trading-backtest-in-python/
Explicit loop parallelism
Numpy:
• Element-wise operations: +, /, ==, exp, log, sqrt, …
• Array creation: zeros, ones_like, random, normal, …
• Others: sum, prod, dot, …
Pandas:
• Column access, and operations: df.A, df[‘A’], df.A.std()
• Filter: df[df.A > .5]
• Rolling windows: df.A.rolling(window=5).mean()
Parallel loop:
for i in prange(n):
s += A[i]**2
12
HPAT Operations
Input code to HPAT should be statically compilable (type stable)
• Dynamic code example:
• Rare in analytics
13
Variable Type Limitation
if flag1:
a = 2
else:
a = np.ones(n)
if isinstance(a, np.ndarray):
doWork(a)
if flag2:
f = np.zeros
else:
f = np.ones
b = f(m)
Data Frame column accesses should be static
• Dynamic code example:
• Refactor to:
14
Pandas Limitation
for i in range(5):
A += df['c'+str(i)]
A += df['c0']
A += df['c1']
A += df['c2']
A += df['c3']
A += df['c4']
Compiler approach superior to library approach for analytics
HPAT bridges productivity-performance gap
• Compiles Python programs to efficient parallel binaries
• Available on GitHub: https://github.com/IntelLabs/hpat
15
Summary
Higher performanceEasier to use
Simpler infrastructureBroader functionality
E. Totoni, A. Roy, S. R. Dulloor, “A Case Against Tiny Tasks in Iterative Analytics”, HotOS’17
E. Totoni, T. A. Anderson, T. Shpeisman, “HPAT: High Performance Analytics with Scripting
Ease-of-Use”, ICS’17
https://arxiv.org/abs/1611.04934
T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, T. Shpeisman, “Parallelizing Julia with a
Non-invasive DSL”, ECOOP’17
E. Totoni, W. Hassan, T. A. Anderson, T. Shpeisman, “HiFrames: High Performance Data
Frames in a Scripting Language”, (arxiv) 2017
https://arxiv.org/abs/1704.02341
16
References