23
Isaac Lyngaas (irly n g aas@g m ail.co m ) John Paige (paigejo@ g m ail.co m ) Advised by: Srinath Vadlamani (s ri n a thv @ u car.ed u ) & Doug Nychka (nychka@ uc ar.ed u ) SIParCS, July 31, 2014

Isaac Lyngaas ([email protected]) John Paige ([email protected]) Advised by: Srinath Vadlamani ([email protected]) & Doug Nychka ([email protected]) SIParCS,

Embed Size (px)

Citation preview

Page 1: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

Isaac Lyngaas ([email protected])

John Paige ([email protected])Advised by: Srinath Vadlamani ([email protected])

& Doug Nychka ([email protected])

SIParCS, July 31, 2014

Page 2: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

Why use HPC with R?

Accelerating mKrig & Krig

Parallel Cholesky◦ Software Packages

Parallel Eigen Decomposition

Conclusions & Future Works

Page 3: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

Accelerate the ‘Fields’ Krig and mKrig functions

Survey of parallel linear algebra software

◦ Multicore (Shared Memory)◦ GPU◦ Xeon Phi

Page 4: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

Many developers & users in the field of Statistics◦ Readily available code base

Problem: R is slow for large size problems

Page 5: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

Bottleneck in Linear Algebra operations◦ mKrig – Cholesky Decomposition◦ Krig – Eigen Decomposition

R uses sequential algorithms

Strategy: Use C interoperable libraries to parallelize linear algebra◦ C functions callable through R environment

Page 6: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

Symmetric positive definite -> Triangular◦ A = LL^T◦ Nice properties for determinant

calculation

Page 8: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

Multicore (Shared Memory)

Block Scheduling◦ Determines what operations should be done

on which core

Block Size optimization◦ Dependent on Cache Memory

Page 9: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

05

Spe

ed

up

vs.

1 C

ore

1015

Plasma using 1 Node(# of Observations = 25000)

8

# of Cores

1 2 4 12 16

Speedup

Optimal Speedup

Page 10: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

67

500 40 Mb 1000

Block Size

1500

34

Tim

e(s

ec)

5

PLASMA on Dual Socket Sandy Bridge (# of Observations=15000, Core=16)

256 Kb

Page 11: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

0 10000 20000

# of Observations

30000 40000

100

200

300

400

500

600

PLASMA Optimal Block Sizes (Cores=16)

Opt

imal

Blo

ck s

ize

Page 12: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

Utilizes GPUs or Xeon Phi for parallelization◦ Multiple GPU & Multiple Xeon Phi

implementations available◦ 1 CPU drives one 1GPU

Block Scheduling◦ Similar to PLASMA

Block Size dependent on Accelerator Architecture

Page 13: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

CUDA Proprietary linear algebra package

Capable of doing Lapack operations using 1 GPU

API written in C

Dense & Spare operations available

Page 14: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

1 Node of Caldera or Pronghorn◦ 2 x 8 core Intel Xeon E5-2670 (Sandy

Bridge) processors per Node 64 GB RAM (~59 GB available) Cache Per Core: L1=32Kb, L2 =256Kb Cache Per Socket: L3=20Mb

◦ 2 x Nvidia Tesla M270Q GPU (Caldera) ~5.2 GB RAM per device 1 core drives 1 GPU

◦ 2 x Xeon Phi 5110P (Pronghorn) ~7.4 GB RAM per device

Page 15: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

• Serial R: ~3 GFLOP/sec

• Theoretical Peak Performance• 16 core Xeon SandyBridge: ~333

GFLOP/sec• 1 Nvidia Tesla M2070Q: ~512 GFLOP/sec• 1 Xeon Phi 5110P: ~1,011 GFLOP/sec

0 10000 20000

# of Observations

30000 40000

010

0G

FLO

P/s

ec

200

300

400

Accelerated Hardware has Room for Improvement

Plasma (16 cores)

Magma 1 GPU

Magma 2 GPUs

Magma 1 MIC

Magma 2 MICs

CULA

Page 16: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

All Parallel Cholesky Implementations are Faster than Serial R

20000

# of Observations

Tim

e(s

ec)

0 10000 30000 40000

0.01

0.1

110

100

1000

Serial R

Plasma (16 Cores)

CULA

Magma 1 GPU

Magma 2 GPUs

Magma 1 Xeon Phi

Magma 2 Xeon Phis

• >100 Times Speedup over serial R when # of Observations = 10k

Page 17: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

• ~6 Times Speedup over serial R when # of Observations = 10k

0 2000 4000 6000 8000 10000

050

100

Tim

e(s

ec)

150

200

250

300

Eigendecomposition also Faster on Accelerated Hardware

# of Observations

Serial R

CULA

Magma 1 GPU

Magma 2 GPUs

Page 18: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

• Both times taken using MAGMA w/ 2 GPUs

0 2000 4000 6000 8000 10000

05

1015

2025

30Can Run ~30 Cholesky Decompositions per Eigen Decomposition

# of Observations

Tim

e E

ige

nd

eco

mp

ositi

on

/ T

ime

Cho

lesk

y

Page 19: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

• If we want to do 16 Cholesky decompositions in parallel, we are guaranteed better performance when speedup >16

0 5000

05

1015

2025

10000

# of Observations

15000 20000

Parallel Cholesky Beats Parallel R for Moderate to Large MatricesS

pee

dup

vs. P

aral

lel

R

Plasma

Magma 2 GPUs

Page 20: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

Using Caldera◦ Single Cholesky Decomposition

◦ Matrix Size < 20k use PLASMA (16 cores w/ optimal block size)

◦ Matrix Size 20k – 35k use MAGMA w/ 2 GPUs◦ Matrix Size > 35k use PLASMA (16 cores w/

optimal block size)

Dependent on computing resources available

Page 21: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

Explored Implementation on accelerated hardware◦ GPUs◦ Multicore (Shared Memory)◦ Xeon Phis

Installed third party linear algebra packages & programmed wrappers that call these packages from R◦ Installation instructions and programs available

through bitbucket repo for access contact Srinath Vadlamani

Future Work◦ Multicore Distributed Memory◦ Single Precision

Page 22: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

Douglas Nychka, Reinhard Furrer, and Stephan Sain. fields: Tools for spatial data, 2014b. URL: http://CRAN.R-project.org/package=fields. R package version 7.1.

Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In Journal of Physics: Conference Series, volume 180, page 012037. IOP Publishing, 2009.

Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du, and Jack Dongarra. A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators. Proc. Of VECPAR’10, Berkeley, CA, June22-25, 2010.

Jack Dongarra, Mark Gates, Azzam Haidar, Yulu Jia, Khairul Kabir, Piotr Luszczek, and Stanimire Tomov. Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. PPAM 2013, Warsaw, Poland, September, 2013.

Page 23: Isaac Lyngaas (irlyngaas@gmail.com) John Paige (paigejo@gmail.com) Advised by: Srinath Vadlamani (srinathv@ucar.edu) & Doug Nychka (nychka@ucar.edu) SIParCS,

xPOTRF

xPOTRF

xPOTRF

xPOTRF

xPOTRF

xTRSM

xTRSM xTRSM xTRSM

xTRSM xTRSM xTRSM

xTRSM xTRSM

xTRSM

xSYRK

xSYRK xSYRK xSYRK

xSYRK xSYRK xSYRK

xSYRK xSYRK

xSYRK

xGEMMxGEMM xGEMM xGEMM xGEMM xGEMM

xGEMM xGEMM xGEMM

xGEMM

xPOTRF xTRSM xSYRK xGEMM

0

1

2

3

0

1

2

3

0

1

2

3

0

1 2

FINAL

• http://www.netlib.org/lapack/lawnspdf/lawn223.pdf