50
Unai Lopez Novoa 19 June 2015 Phd Dissertation Advisors: Jose Miguel-Alonso & Alexander Mendiburu Contributions to the Efficient Use of General Purpose Coprocessors: Kernel Density Estimation as Case Study

Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Embed Size (px)

Citation preview

Page 1: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Unai Lopez Novoa

19 June 2015

Phd Dissertation

Advisors: Jose Miguel-Alonso & Alexander Mendiburu

Contributions to the Efficient Use of

General Purpose Coprocessors:Kernel Density Estimation as Case Study

Page 2: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Outline

• Introduction

• Contributions1) A Survey of Performance Modeling and Simulation Techniques

2) S-KDE: An Efficient Algorithm for Kernel Density Estimation

• And its implementation for Multi and Many-cores

3) Implementation of S-KDE in General Purpose Coprocessors

4) A Methodology for Environmental Model Evaluation based on S-KDE

• Conclusions

2

Page 3: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Introduction

3

Page 4: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

High Performance Computing• Branch of computer science related to the use of parallel

architectures to solve complex computational problems

• Today’s fastest supercomputer: Tianhe-2

4Introduction

(China’s National University of Defense and Technology, 33.86 PFLOP/s)

Page 5: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

HPC Environments• Traditional HPC systems were homogeneous, built

around single or multi-core CPUs• But supercomputers are becoming heterogeneous

5Introduction

(Coprocessor number evolution in the Top500 list over time)

Page 6: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Compute platforms

Introduction 6

Multi-Core CPUs

• Branch prediction, OoOE• “Versatile”

Up to250 GFLOP/s

GraphicsProcessing

Units

• Hundreds of cores• Handle thousands of threads Up to

1.8 TFLOP/s

Many-CoreProcessors

• Tens of x86 cores• HyperThreading

Up to1 TFLOP/s

Device Features Peak D.P. Performance

Page 7: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Motivation• Examples of successful porting of applications to

accelerators (compared againts multi-core implementations):

• SAXPY: 11.8x• Polynomial Equation Solver: 79x• Image Treatment (MRI): 263x• …

• … but this is not applicable for every HPC code

Introduction 7

Ryoo, Shane, et al. "Optimization principles and application performance evaluation of a multithreaded GPU using CUDA." Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming. ACM, 2008. (>700 cites on Google Scholar)

Page 8: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Difficulties using accelerators• Suitable codes for accelerators should:

• Expose high levels of parallelism• Have a good spatial/temporal data locality• …

• Porting a code requires extensive program rewriting

• Development tools for accelerators are not as polished as those for CPUs

Effectively exploiting the performance of a coprocessor remains as a challenging task

Introduction 8

Page 9: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Structure of this thesis

Introduction 9

A survey of performance modeling and simulation techniques

Design of a novel algorithm for Kernel Density Estimation: S-KDE

S-KDE for Multi & Many-CoresS-KDE for Accelerators

A methodology for environmental model evaluation based on S-KDE

Motivation:Discuss the issues to efficiently use

general purpose coprocessors

Case Study:Kernel Density Estimation applied to environmental model evaluation

Page 10: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

A Survey of Performance Modeling and Simulation Techniques

10

Page 11: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Developing for accelerators• Approaches/aids:

A Survey of Performance Modeling and Simulation Techniques 11

Trial and error

Profilers / Debuggers / …

Performance Models

Page 12: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

A survey of models and simulators

• Accelerator & GPGPU trend began ~2005• First performance models appeared ~2007

• Abundant literature• No outstanding models or tools

A Survey of Performance Modeling and Simulation Techniques 12

Page 13: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Taxonomy

A Survey of Performance Modeling and Simulation Techniques 13

Execution timeestimation

Bottleneckhighlighting

Power cons. estimation

Simulators

Page 14: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Model analysis

• We analysed 29 relevant accelerator models

• For each of them we summarized and identified:• Modeling method (Analytical, Machine Learning,…)• Target platforms and test devices• Input preprocessing requirements• Limitations• Highlights over other models

A Survey of Performance Modeling and Simulation Techniques 14

Page 15: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

The MWP-CWP model• Presented by Hong & Kim in 2009 (>360 cites in Google Scholar)

• Estimates the execution time of a GPU application• Based on how Warps are scheduled in NVIDIA GPUs

A Survey of Performance Modeling and Simulation Techniques 15

Test platform Input requirements Limitations HighlightsMethod

Analytical NVIDIA GPUs (8800GT,…)

Run µbenchmarks & Parse PTX

Branches notmodeled

Extendable to non-NVIDIA GPUs

Page 16: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

The Roofline model• Presented by Williams et al. in 2009 (>450 cites in Google Scholar)

• Outstanding model for bottleneck highlighting• Visual model:

A Survey of Performance Modeling and Simulation Techniques 16

Test platform Input requirements Limitations HighlightsMethod

Analytical Multi-core CPUs& Accelerators

Run µbenchmarks & Analyse application

Depends onarchitecture

Visual output to guide optimizations

Page 17: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Performance tools

• Some models require running performance tools(µbenchmarks, profilers,…)

• We have reviewed them as well

A Survey of Performance Modeling and Simulation Techniques 17

Page 18: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Conclusions

1) There is no accurate model valid for a wide set of architectures

2) Most models are tied to CUDA

3) There is a growing interest in analyzing power

4) It was impossible to make a comparison of the models (lack of details, codes, …)

A Survey of Performance Modeling and Simulation Techniques 18

Page 19: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

S-KDE: An Efficient Algorithm for Kernel Density Estimation

(and its implementation for Multi and Many-cores)

19

Page 20: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Case study

• Collaborative Work: EOLO

UPV/EHU Climate and Meteorology Group

• Scenario:Environmental Model Evaluation

• Problem:Excessive execution times of KDE

S-KDE: An Efficient Algorithm for Kernel Density Estimation 20

Page 21: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Kernel Density Estimation• Statistical technique used to estimate the Probability

Density Function (PDF) of a random variable with unknown characteristics

• where:• xi are the samples from the random variables

• K is the Kernel function• H is the bandwidth value

S-KDE: An Efficient Algorithm for Kernel Density Estimation 21

Page 22: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Kernel function• Symmetric function that integrates to one• We classify them according to area of influence

S-KDE: An Efficient Algorithm for Kernel Density Estimation 22

Bounded Unbounded

Page 23: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Bandwidth• Parameter to control the smoothness of the estimation• It must be carefully selected

• Common approaches for its selection• Heuristic as in Silverman, 1986• Iterative technique, e.g., bootstraping

S-KDE: An Efficient Algorithm for Kernel Density Estimation 23

Page 24: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Computing KDE

S-KDE: An Efficient Algorithm for Kernel Density Estimation 24

Naive approach: EP-KDE

for each eval_point e in E

for each sample s in S

d = distance(e,s)

e += density (d)

Our proposal: S-KDE

for each sample s in S

B = findInfluenceArea(s)

for each eval_point e in B

d = distance(e,s)

e += density (d)

Complexity: O(|E|·|S|) Complexity: O(|B|·|S|)

Page 25: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Delimiting the influence area

S-KDE: An Efficient Algorithm for Kernel Density Estimation 25

• Depends on the Kernel• Our case: Epanechnikov kernel • Technique based on a method in Fukunaga, 1990

Page 26: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Chop & Crop• In spaces of dimensionality 3 and higher, the number of

evaluation points outside the influence area increases • We developed a technique to further reduce evaluations:

S-KDE: An Efficient Algorithm for Kernel Density Estimation 26

Step 1: Chop the box into slices Step 2: Crop the slice

Page 27: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Example numbers

500k Samples 3D dataset

194M Evaluation point space

EP-KDE: 9.74 * 1013 distance-density evaluations

102461 Evaluation points per Bounding box (average)

S-KDE: 5.12 * 1010 evaluations

With C&C: 53511 Evaluation point per Bounding box (average)

S-KDE + C&C: 2.67 * 1010 evaluations

S-KDE: An Efficient Algorithm for Kernel Density Estimation 27

Page 28: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

S-KDE in OpenMP

S-KDE: An Efficient Algorithm for Kernel Density Estimation 28

Initialization

Distribute samplesto threads

Fit bounding box

Chop into slices

Crop and computedensity

Accumulate densityto evaluation space

#pragma omp for

#pragma simd

#pragma atomic

Page 29: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

S-KDE in OpenMP• Targeting Multi and Many core processors

• Tested platforms:• Intel i7 Intel Core i7 3820 CPU (4 Cores @ 3.6 GHz)• Intel Xeon Phi 3120A (57 Cores @ 1.1 GHz, Native mode)

• Public KDE implementations used as yardsticks:• Ks-kde (R Package)• GPUML • Several Python libraries

S-KDE: An Efficient Algorithm for Kernel Density Estimation 29

Page 30: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Execution time comparison

S-KDE: An Efficient Algorithm for Kernel Density Estimation 30

Page 31: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Conclusions

1) S-KDE + Chop & Crop reduces KDE complexity

2) Native, parallel implementation for Multi and Many-core processors

• OpenMP

3) We beat state-of-the-art alternatives

S-KDE: An Efficient Algorithm for Kernel Density Estimation 31

Page 32: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Implementation of S-KDE in General Purpose Coprocessors

32

Page 33: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

S-KDE in OpenCL

Implementation of S-KDE in General Purpose Coprocessors 33

Initialization

Fit box & Chop

Crop

Offset calculation

Density computation

Density transfer

Density accumulation

(1)

(2)

(3)

(4)

(5)

(6)

(7)

• Host code• Accelerator code

Page 34: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Execution time comparison

Implementation of S-KDE in General Purpose Coprocessors 34

Page 35: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Conclusions

1) OpenCL version of S-KDE provides good overall performance

2) The consolidation stage is the main bottleneck

3) The code is close to the limits of the accelerators

4) Further performance gains using pipelined execution

Implementation of S-KDE in General Purpose Coprocessors 35

Page 36: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

A Methodology for Environmental Model Evaluation based on S-KDE

36

Page 37: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Climate models

• Mathematical representations of a climate system, based on physical, chemical and biological principles

• They predict a trend in a long term time• Recently used to asses the impact of greenhouse gases

A Methodology for Environmental Model Evaluation based on S-KDE 37

Page 38: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Climate model evaluation

• Models must be validated against actual observations

• There is not a universally accepted validation strategy

• Popular approaches:• Averaged values per estimated variable• Evaluating the per-variable Probability Density Functions (PDFs)

A Methodology for Environmental Model Evaluation based on S-KDE 38

Page 39: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

PDF-based model evaluation• Current approaches:

1) Compute the PDF per estimated variable

2) Calculate similarity score per-variable against observations

3) Combine the scores to get global performance of the model

• Lack of a universally accepted way to combine the scores

• Our proposal:• An extension of the score by [1] to multiple dimensions• A methodology to evaluate multiple variables in a single step

A Methodology for Environmental Model Evaluation based on S-KDE 39

[1]: Perkins, S. E., et al. "Evaluation of the AR4 climate models' simulated daily maximum temperature, minimum temperature, and precipitation over Australia using probability density functions." Journal of climate 20.17 (2007): 4356-4376.

Page 40: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Methodology

A Methodology for Environmental Model Evaluation based on S-KDE 40

1) Estimate optimal bandwidth

Iterative use of KDE

h = 0.6 h = 0.65

Estimations

MIROCS3.2-MR Model

Observations

3) Compute score S = 0.74

2) Compute PDF with opt. bandwidth

Single useof KDE

PDF (Estimations)h = 0.6

PDF (Observations)h = 0.65

PDF (Observations)h = 0.65

Page 41: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Evaluation• Models: 7 from CMIP3 experiment (with different configurations)

• Dataset: 20C3M (1961 to 1998 on a daily basis)

• Variables:• Global average of surface temperature• Difference in temperature between N and S hemispheres• Difference in temperature between Equator and the poles

• Scores for the models:

A Methodology for Environmental Model Evaluation based on S-KDE 41

NCEP MIROCS3.2MR-2

MIROCS3.2MR-3

HADGEM1

MIROCS3.2MR-1

MIROCS3.2-HR

GFDL-CM2.1

GFDL-CM2.0

BCM2.0 ECHAM5

MRI-RUN03

MRI-RUN04

MRI-RUN01

MRI-RUN02

MRI-RUN05

0,82 0,74 0,73 0,71 0,7 0,67 0,62 0,6 0,51 0,48 0,3 0,29 0,29 0,29 0,28

Page 42: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

A Methodology for Environmental Model Evaluation based on S-KDE 42

MIROC3.2-MR-RUN02Score = 0,74

MRI-RUN01Score = 0,29

EvaluationSurface: ObservationsContour: Model

C0: Global average surface temperatureC1: Difference in temperature between Hemispheres

Page 43: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Conclusions

1)We have presented a methodology based on the extension to multiple dimensions of the index by Perkins et al.

2)It allows evaluating multiple variables of an environmental model in a single step

3)It is feasible in time thanks to the use of a fast implementation of KDE: S-KDE

A Methodology for Environmental Model Evaluation based on S-KDE 43

Page 44: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Conclusions

44

Page 45: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Summary of contributions

•We have conducted an extensive survey on performance models using a proposed taxonomy

•We have designed S-KDE, a technique that reduces the complexity of Kernel Density Estimation computations

•We have implemented S-KDE for Multi and Many-cores using OpenMP

• Outperforming the state-of-the-art parallel codes for KDE

Conclusions 45

Page 46: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Summary of contributions

• We have presented an OpenCL implementation of S-KDE for general purpose coprocessors. • It reaches the limits of the devices and acceptable performance,

but requires further work

• We have designed of a methodology for environmental model evaluation based on KDE, that allows to evaluate multiple variables from a model accurately in a simple way• S-KDE is a key, enabling element

Conclusions 46

Page 47: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Future work• We intend to develop a methodology for the performance

evaluation accelerator-based applications, based on the survey presented as first contribution

• We need to improve S-KDE in both multi-cores and coprocessors

• In particular, the consolidation stage

• We intend to design a technique to analyse new climate data from the CMIP Project, with dimensionality up to ten

Conclusions 47

Page 48: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Publications

Conclusions 48

Unai Lopez-Novoa, Alexander Mendiburu, and Jose Miguel-Alonso. A survey of performance modeling and simulation techniques for accelerator-based computing. IEEE Transactions on Parallel and Distributed Systems, 26(1):272–281, Jan 2015

Unai Lopez-Novoa, Jon Saenz, Alexander Mendiburu, and Jose Miguel-Alonso. An efficient implementation of kernel density estimation for multi-core & many-core architectures. International Journal of High Performance Computing Applications, Accepted, 2015, DOI: 10.1177/1094342015576813

Page 49: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Publications

Conclusions 49

Unai Lopez-Novoa, Alexander Mendiburu, and Jose Miguel-Alonso. Kernel density estimation in accelerators: Implementation and performance evaluation. Parallel Computing. To be submitted.

Unai Lopez-Novoa, Jon Saenz, Alexander Mendiburu, Jose Miguel-Alonso, Inigo Errasti, Ganix Esnaola, Agustin Ezcurra, and Gabriel Ibarra-Berastegi. Multi-objective environmental model evaluation by means of multidimensional kernel density estimators: Efficient and multi-core implementations. Environmental Modelling & Software, 63:123 – 136, 2015

Page 50: Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Case Study [PhD Defense]

Unai Lopez Novoa

19 June 2015

Phd Dissertation

Advisors: Jose Miguel-Alonso & Alexander Mendiburu

Contributions to the Efficient Use of

General Purpose Coprocessors:Kernel Density Estimation as Case Study