17
1 Performance & Energy Optimization @ Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15

Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

1

Performance & Energy Optimization @

Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane

Ahmad Qawasmeh Barbara M. Chapman

1

11/28/15

Page 2: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

2

Layout of the talk

Ø Overview Ø Motivation Ø Factors that affect the performance & Energy Optimization Ø Experimental Results Ø Conclusion & Future Work

Page 3: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

3

OpenMP

Ø De-facto standard for shared memory parallel programming

Ø Thread based parallelism Ø Mainly two kinds of parallelism

Ø Regular parallelism (work sharing constructs) Ø Irregular parallelism (task based constructs)

Page 4: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

4

Main Barrier Towards Exascale Computing…

Ø Power, power and power Ø 20MW power limit for exascale machines (DOE) Ø Usually processor vendors concern Ø But to reach the exascale limit software stack have to

chip in Ø Any solution????

Page 5: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

5

Power Constrained Computing (Overprovisioning)

Ø Usually not all application use maximum node power all the time

Ø Capping the power at lower limit Ø Allows extra node to be added at the similar power

budget

ExtraNode ExtraComputePower

Page 6: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

6

Power Constrained Computing(Contd.)

Ø More focus on overall system level performance Ø Some related work,

Ø Sarood et al. [1] Ø Patki et al. [2] Ø Rountree et al. [3]

1.  Sarood,Osman,etal."Op?mizingpoweralloca?ontoCPUandmemorysubsystemsinoverprovisionedHPCsystems."ClusterCompu,ng(CLUSTER),2013IEEEInterna,onalConferenceon.IEEE,2013.

2.  Patki,Tapasya,etal."Exploringhardwareoverprovisioninginpower-constrained,highperformancecompu?ng."Proceedingsofthe27thinterna,onalACMconferenceonInterna,onalconferenceonsupercompu,ng.ACM,2013.

3.  Rountree,Barry,etal."BeyondDVFS:Afirstlookatperformanceunderahardware-enforcedpowerbound."ParallelandDistributedProcessingSymposiumWorkshops&PhDForum(IPDPSW),2012IEEE26thInterna,onal.IEEE,2012.

Page 7: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

7

Why OpenMP???

Ø Current Issue: Less focus on per-node performance Ø Challenge: To reach the peak throughput, per-node performance

must be improved Ø OpenMP is the most popular language of choice for intra node

parallelism

Page 8: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

8

Factors That Impact Work Sharing Parallelism…

Ø How many workers are working? ~ Thread Ø How the work is scheduled? ~ Scheduling Policy Ø How much work they are given at one time? ~ Chunk Size Ø How the data is laid out for the workers? ~ Thread Affinity Ø What do the workers do during their break? ~ Wait Policy

Page 9: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

9

Experimental Details Ø Selected parameters

Ø No. Of threads (2, 4, 8, 16, 24, 32) Ø Scheduling policy (STATIC, DYNAMIC, GUIDED) Ø Chunk size(1, 8, 32, 64, 128, 256, 512) Ø Wait policy (active, passive) Ø Thread affinity (OMP_PLACES + OMP_PROC_BIND)

Ø Power cap levels Ø (55, 70, 85, 100, 115)w

Ø Used technology: Ø  Intel RAPL (for power capping & energy measurement) Ø OMPT for kernel level measurement

Ø Benchmark ~ NPB

Page 10: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

10

0102030405060708090

100

CG_c

onj_gr

ad_1

CG_m

ain_

1CG_m

ain_

2CG_m

ain_

3CG_m

ain_

4CG_m

ain_

5CG_m

ain_

6EP

_main_

3FT

_c_s1

_1

FT_c

_s3

_1

FT_c

_ts2_

1*F

T_c_

i_1

**FT

_c_i_c

_1

FT_e

volve_

1FT

_init_

ui_1

IS_a

lloc

_key

_bu

IS_c

reat

e_se

q_1

LU_e

rhs_

1LU

_set

bv_1

LU

_se?

v_1

MG_z

ero3

_1

MG_z

ran3

_1

MG_z

ran3

_2

MG_z

ran3

_3

SP_a

dd_1

SP

_com

pute

_rhs

SP_e

rror

_nor

m_

SP_e

xact_r

hs_1

SP

_ini?alize_

1SP

_ninvr_1

SP

_pinvr_1

SP

_rhs

_nor

m_1

SP

_txinv

r_1

SP_t

zetar_

1SP

_x_s

olve

_1

SP_y

_solve

_1

SP_z

_solve

_1

UA_g

eom1_

2UA_m

ortar_

3UA_m

ove_

1

%Perform

anceIm

provem

ent

Kernels

55W 70W 85W 100W 115W

Performance improvement using the best configuration compared to default across all kernels

Page 11: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

11

-20

0

20

40

60

80

100

CG_c

onj_gr

ad_1

CG_m

ain_

1CG_m

ain_

2CG_m

ain_

3CG_m

ain_

4CG_m

ain_

5CG_m

ain_

6EP

_main_

3FT

_c_s1

_1

FT_c

_s3

_1

FT_c

_ts2_

1*F

T_c_

i_1

**FT

_c_i_c

_1

FT_e

volve_

1FT

_init_

ui_1

IS_a

lloc

_key

_buff

_1

IS_c

reat

e_se

q_1

LU_e

rhs_

1LU

_set

bv_1

LU

_se?

v_1

MG_z

ero3

_1

MG_z

ran3

_1

MG_z

ran3

_2

MG_z

ran3

_3

SP_a

dd_1

SP

_com

pute

_rhs

_1

SP_e

rror

_nor

m_1

SP

_exa

ct_r

hs_1

SP

_ini?alize_

1SP

_ninvr_1

SP

_pinvr_1

SP

_rhs

_nor

m_1

SP

_txinv

r_1

SP_t

zetar_

1SP

_x_s

olve

_1

SP_y

_solve

_1

SP_z

_solve

_1

UA_g

eom1_

2UA_m

ortar_

3UA_m

ove_

1%EnergyIm

provem

ent

Kernels

55W 70W 85W 100W 115W

Energy consumption improvement using the best configuration compared to default across all kernels

Page 12: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

12

24,GUIDED,8

24,DYNAM

IC,8

24,GUIDED,8

24,GUIDED,8

32,DYNAM

IC,8

32,STATIC,1

32,STATIC,1

32,STATIC,1

32,STATIC,1

32,STATIC,1

115W,32,STATIC,1

115W,32,STATIC,1

115W,32,STATIC,1

115W,32,STATIC,1

115W,32,STATIC,1

0

0.01

0.02

0.03

0.04

0.05

0.06

55W 70W 85W 100W 115W

Execu?

onTim

e(Sec)

DifferentPowerCapLevels

Execution time comparison among different configurations (an LU kernel)

BestConfigura?on DefaultConfigura?on DefaultConfigura?onWithoutPowerCap

Page 13: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

13

OpenMP ICVs on DRAM Power

Ø Developing a model for power consumption of openmp applications

Dat

a Si

ze X

mea

ns th

e ar

ray

size

for S

TREA

M b

ench

mar

k is

19,

200,

000*

X.

Thes

e re

sults

are

bas

ed o

n ST

REAM

ben

chm

ark.

Cou

rtesy

: Mill

ad G

hane

DataSize

Power(W

)

Power(W

)

Power(W

)

DataSize

DataSize

Page 14: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

14

UTS

FloorPlan

Courtesy:

Ahmad

Qaw

asmeh

Impact of threads & scheduling policy in task based parallelism

Page 15: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

15

Ongoing Work Ø Dynamic adaptation (APEX),

Ø Active harmony Ø Modeling

Ø Across different software stack (OpenMP runtime), Ø Openuh Ø GCC Ø Intel

Ø Across different hardware architecture Ø Intel sandybridge Ø IBM power8

Page 16: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

16

Future Work

Ø More concrete configuration selection Ø DRAM capping Ø Fine grain (core level) control Ø Other energy efficient techniques,

Ø DVFS, frequency modulation etc.

Ø Combining it with a inter-node (MPI) programming models for hybrid applications

Page 17: Performance & Energy Optimization - OpenMP · 2015. 11. 28. · Md Abdullah Shahneous Bari Abid M. Malik Millad Ghane Ahmad Qawasmeh Barbara M. Chapman 1 11/28/15 . 2 Layout of the

17

Summary Ø  Overview

Ø  Motivation

Ø  Factors that affect the performance & Energy Optimization

Ø  Experimental Results

Ø  Conclusion & Future Work