“OpenMP Does Not Scale”

Preview:

Citation preview

8

RuudvanderPasDis.nguishedEngineer

ArchitectureandPerformanceSPARCMicroelectronics,Oracle

SantaClara,CA,USA

“OpenMP Does Not Scale”

9

Agendan The Myth n Deep Trouble n Get Real n The Wrapping

10

TheMyth

11

“A myth, in popular use, is something that is widely believed but false.” ......

(source: www.reference.com)

12

“OpenMPDoesNotScale”

ACommonMyth

AProgrammingModelCanNot“NotScale”

WhatCanNotScale:

TheImplementaDonTheSystemVersusTheResourceRequirements

Or.....You

1313

Hmmm....WhatDoesThatReallyMean?

14

SomeQues.onsICouldAsk“Doyoumeanyouwroteaparallelprogram,usingOpenMPanditdoesn'tperform?”“Isee.DidyoumakesuretheprogramwasfairlywellopDmizedinsequenDalmode?”

15

SomeQues.onsICouldAsk

“Oh.Youjustthinkitshouldandyouusedallthecores.HaveyouesDmatedthespeedupusingAmdahl'sLaw?”

“Oh.Youdidn't.Bytheway,whydoyouexpecttheprogramtoscale?”

“No,thislawisnotanewEUfinancialbailoutplan.Itissomethingelse.”

16

SomeQues.onsICouldAsk“Iunderstand.Youcan'tknoweverything.HaveyouatleastusedatooltoidenDfythemostDmeconsumingpartsinyourprogram?”

“Oh.Youdidn't.Didyouminimizethenumberofparallelregionsthen?”

“Oh.Youdidn't.Itjustworkedfinethewayitwas.

“Oh.Youdidn't.Youjustparallelizedallloopsintheprogram.Didyoutrytoavoidparallelizinginnermostloopsinaloopnest?”

17

MoreQues.onsICouldAsk“Didyouatleastusethenowaitclausetominimizetheuseofbarriers?”“Oh.You'veneverheardofabarrier.Mightbeworthtoreadupon.”“Doallthreadsroughlyperformthesameamountofwork?”

“Youdon'tknow,butthinkitisokay.Ihopeyou'reright.”

18

IDon’tGiveUpThatEasily“DidyoumakeopDmaluseofprivatedata,ordidyousharemostofit?”“Oh.Youdidn't.Sharingisjusteasier.Isee.

19

IDon’tGiveUpThatEasily

“You'veneverheardofthateither.Howunfortunate.CouldthereperhapsbeanyfalsesharingaffecDngperformance?”“Oh.Neverheardofthateither.MaycomehandytolearnaliYlemoreaboutboth.”

“Youseemtobeusingacc-NUMAsystem.Didyoutakethatintoaccount?”

20

TheGrassIsAlwaysGreener...“So,whatdidyoudonexttoaddresstheperformance?”

“SwitchedtoMPI.Isee.DoesthatperformanybeYerthen?”

“Oh. You don't know. You're still debugging the code.”

21

GoingIntoPedan.cMode

“Whileyou'rewaiDngforyourMPIdebugruntofinish(areyousureitdoesn'thangbytheway?),pleaseallowmetotalkaliYlemoreaboutOpenMPandPerformance.”

22

DeepTrouble

23

n The transparency and ease of use of OpenMP are a mixed blessing

à Makes things pretty easy

à May mask performance bottlenecks

n In the ideal world, an OpenMP application “just performs well”

n Unfortunately, this is not always the case

OpenMPAndPerformance/1

24

n Two of the more obscure things that can negatively impact performance are cc-NUMA effects and False Sharing

n Neither of these are restricted to OpenMP

à They come with shared memory programming on modern

cache based systems

à But they might show up because you used OpenMP

à In any case they are important enough to cover here

OpenMPAndPerformance/2

2525

ConsideraDonsforcc-NUMA

26

AGenericcc-NUMAArchitecture

CacheCoherentInterconnect

ProcessorMem

ory

ProcessorMem

ory

Main Issue: How To Distribute The Data ?

Local Access (fast)

Remote Access (slower)

27

n Important aspect on cc-NUMA systems

à If not optimal, longer memory access times and hotspots

n OpenMP 4.0 does provide support for cc-NUMA

à Placement under control of the Operating System (OS)

à User control through OMP_PLACES

n Windows, Linux and Solaris all use the “First Touch” placement policy by default

à May be possible to override default (check the docs)

AboutDataDistribu.on

28

for (i=0; i<10000; i++) a[i] = 0;

CacheCoherentInterconnectMem

ory

Processor

Mem

ory

a[0] : a[9999]

ProcessorProcessor

First Touch All array elements are in the memory of

the processor executing this thread

ExampleFirstTouchPlacement/1

29

for (i=0; i<10000; i++) a[i] = 0;

CacheCoherentInterconnect

Mem

ory

Processor

Mem

ory

a[0] : a[4999]

#pragma omp parallel for num_threads(2)

First Touch Both memories now have “their own

half” of the array

a[5000] : a[9999]

Processor

ExampleFirstTouchPlacement/2

ProcessorProcessor

30

GetReal

31

“Don’tTryThisAtHome(yet)”

32

TheIni.alPerformance(35GB)

0.000#

0.010#

0.020#

0.030#

0.040#

0.050#

0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#

Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5J2#Performance#(SCALE#26)#

Gameoverbeyond16threads

33

Thatdoesn’tscaleverywell

Let’suseabiggermachine!

34

Ini.alPerformance(35GB)

0.000#

0.010#

0.020#

0.030#

0.040#

0.050#

0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#

Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5J2#and#T5J8#Performance#

T5J2#SCALE#26# T5J8#SCALE#26#

3535

Oops!Thatcan’tbetrue

Let’srunalargergraph!

36

Ini.alPerformance(280GB)

36

0.000#

0.010#

0.020#

0.030#

0.040#

0.050#

0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5J2#and#T5J8#Performance#

T5J2#SCALE#29# T5J8#SCALE#29#

3737

Let’sGetTechnical

38

37%$ 31%$22%$ 15%$

7%$ 6%$

27%$21%$

16%$

10%$

6%$ 4%$

29%$

26%$

20%$

13%$

8%$6%$

11%$

20%$

27%$

22%$ 27%$

4%$ 5%$12%$

17%$

31%$ 34%$

3%$ 6%$ 15%$25%$ 22%$

3%$ 2%$ 4%$ 3%$ 2%$ 2%$

0%$

20%$

40%$

60%$

80%$

100%$

1$ 2$ 4$ 8$ 16$ 20$Number$of$threads$

Total$CPU$Time$Percentage$DistribuDon$(Baseline,$SCALE$26)$

Atomic$operaDons$

OMPPatomic_wait$

OMPPcriDcal_secDon_wait$

OMPPimplicit_barrier$

Other$

make_bfs_tree$

verify_bfs_tree$

TotalCPUTimeDistribu.on

38

Func.on1

Func.on2

39

0"

10"

20"

30"

40"

50"

60"

70"

80"

0" 750" 1500" 2250" 3000" 3750" 4500" 5250" 6000" 6750" 7500" 8250" 9000"

Band

width"(G

B/s)"

Elapsed"Time"(seconds)"

SPARC"T5F2"Measured"Bandwidth"(BASE,"SCALE"28,"16"threads)""

Total" Read"(Socket"0)" Read"(Socket"1)" Write"(Socket"0)" Write"(Socket"1)"

BandwidthOfTheOriginalCode

39

Lessthanhalfofthememorybandwidthisused

40

SummaryOriginalVersion

40

•  Communica4oncostsaretoohigh–  Increasesasthreadsareadded– Thisseriouslylimitsthenumberofthreadsused– Thisisturnaffectsmemoryaccessonlargergraphs

•  Thebandwidthisnotbalanced•  Fixes:

– FindandfixmanyOpenMPinefficiencies– Usesomeefficientatomicfunc4ons

41

Methodology

UseTheChecklistToIdenDfyBoYlenecks

UseAProfilingTool

IfTheCodeDoesNotScaleWell

TackleThemOneByOne

ThisIsAnIncrementalApproach

ButVeryRewarding

4242

BO BOSecretSauce

43

NotethemuchshorterrunDmeforthemodifiedversion

ComparisonOfTheTwoVersions

43

Atomic/cri4calsec4onwait

Implicitbarrier

Idlestate

44

0.00#

0.10#

0.20#

0.30#

0.40#

0.50#

0.60#

0.70#

0# 16# 32# 48# 64# 80# 96# 112#128#144#160#176#192#208#224#240#256#Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5L2#Performance#(SCALE#29)#

T5L2#Baseline# T5L2#OPT1#

Peakperformanceis13.7xhigher

PerformanceComparison

44

4545

BO MOMore

SecretSauce

46

Observa.ons

ButNeedsIt....

TheCodeDoesNotExploitLargePages

FirstTouchPlacementIsNotUsed

UsedASmarterMemoryAllocator

47

0"

25"

50"

75"

100"

125"

150"

0" 400" 800" 1200" 1600" 2000" 2400" 2800" 3200" 3600" 4000"

Band

width"(G

B/s)"

Elasped"Time"(seconds)"

SPARC"T5E2"Measured"Bandwidth"(OPT2,"SCALE"29,"224"threads)""

Total"BW"(GB/s)" Read"(GBs/s)" Write"(GB/s)"

BandwidthOfTheNewCode

47

MaximumReadBandwidth135GB/s

48

TheResult

48

0.04 0.03 0.03

1.13

0.790.89

1.47

1.731.67

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

SCALE29(282GB) SCALE30(580GB) SCALE31(1150GB)

Billion

Searche

dEd

gesP

erSecon

d(GTEPS)

T5-8OPT0(BO) T5-8OPT1(BO) T5-8OPT2(MO)

39-52ximprovementoveroriginalcode

49

0"

10"

20"

30"

40"

50"

60"

70"

80"

0"

100"

200"

300"

400"

500"

600"

700"

800"

1" 2" 4" 8" 16" 32" 64" 128" 256" 384" 512"

Elap

sed"Time"(m

inutes)"

Number"of"threads"

SPARC"T5E8"Graph"Search"Performance""(SCALE"30"E"Memory"Requirements:"580"GB)"

Search"Time"(minutes)" Speed"Up"

BiggerIsDefinitelyBe[er!

49

72xspeedup!

SearchDmereducedfrom12hoursto10

minutes

50

0.00#

0.20#

0.40#

0.60#

0.80#

1.00#

1.20#

1.40#

1.60#

0# 64# 128# 192# 256# 320# 384# 448# 512# 576# 640# 704# 768# 832# 896# 960#1024#

Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5L8#(SCALE#32,#OPT2)##

A2.3TBSizedProblem

50

EvenstarDngat32threadsthespeedupissDll11x

896Threads!

51

0"5"

10"15"20"25"30"35"40"45"50"55"

26" 27" 28" 29" 30" 31"

Speed"Up"Over"T

he"OPT

0"Ve

rsion"

SCALE"Value"

SPARC"T5D8"Speed"Up"Over"OPT0"

OPT1" OPT2"

TuningBenefitBreakdown

51

Somewhatdiminishingreturn

Biggerisbe[er

5252

MO MOBODifferent

SecretSauce

53

57-75ximprovement

0.04 0.03 0.03

1.13

0.790.89

1.47

1.73 1.67

2.15

2.43 2.42

0.00

0.50

1.00

1.50

2.00

2.50

SCALE29(282GB) SCALE30(580GB) SCALE31(1150GB)

Billion

Searche

dEd

gesP

erSecon

d(GTEPS)

T5-8OPT0(BO) T5-8OPT1(BO) T5-8OPT2(MO) T5-8OPT2(MOBO)

ASimpleOpenMPChange

5454

“IValueMyPersonalSpace”

55

MyFavoriteSimpleAlgorithmvoid mxv(int m,int n,double *a,double *b[], double *c) { for (int i=0; i<m; i++) { double sum = 0.0; for (int j=0; j<n; j++) sum += b[i][j]*c[j]; a[i] = sum; } }

parallel loop

= *

j

i

56

TheOpenMPSource

#pragma omp parallel for default(none) \ shared(m,n,a,b,c) for (int i=0; i<m; i++) { double sum = 0.0; for (int j=0; j<n; j++) sum += b[i][j]*c[j]; a[i] = sum; }

= *

j

i

57

0"

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

Memory"Footprint"(KByte,"logD2"scale)"

1x1"thread" 2x1"threads" 4x1"threads" 8x1"threads" 8x2"threads"

PerformanceOnIntelNehalem

Maxspeedupis~1.6xonly

Waitaminute!Thisalgorithmishighly

parallel....

NotaDon:Numberofcoresxnumberofthreadswithincore

System: Intel X5570 with 2 sockets, 8 cores, 16 threads at 2.93 GHz

58

?

59

Let'sGetTechnical

60

hwthread0hwthread1

core 0 caches

hwthread0hwthread1

core 1 caches

hwthread0hwthread1

core 2 caches

hwthread0hwthread1

core 3 caches

socket 0

shar

ed c

ache

mem

ory

hwthread0hwthread1

core 0 caches

hwthread0hwthread1

core 1 caches

hwthread0hwthread1

core 2 caches

hwthread0hwthread1

core 3 caches

socket 1

shar

ed c

ache

mem

ory

QPI

Inte

rcon

nect

ATwoSocketNehalemSystem

61

DataIni.aliza.onRevisited#pragma omp parallel default(none) \ shared(m,n,a,b,c) private(i,j) { #pragma omp for for (j=0; j<n; j++) c[j] = 1.0; #pragma omp for for (i=0; i<m; i++) { a[i] = -1957.0; for (j=0; j<n; j++) b[i][j] = i; } /*-- End of omp for --*/ } /*-- End of parallel region --*/

= *

j

i

62

0"

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

Memory"Footprint"(KByte,"logD2"scale)"

1x1"thread" 2x1"threads" 4x1"threads" 8x1"threads" 8x2"threads"

DataPlacementMa[ers!

Theonlychangeisthewaythedataisdistributed

Maxspeedupis~3.3xnow

NotaDon:Numberofcoresxnumberofthreadswithincore

System: Intel X5570 with 2 sockets, 8 cores, 16 threads at 2.93 GHz

63

mem

ory

mem

ory

Syst

em In

terc

onne

ct

ATwoSocketSPARCT4-2Systemhwthread0

hwthread7

core 0 caches

hwthread0

hwthread7

core 7 caches shar

ed c

ache

.....

.....

.....

.....

socket 0

hwthread0

hwthread7

core 0 caches

hwthread0

hwthread7

core 7 caches shar

ed c

ache

.....

.....

.....

.....

socket 1

64

0"

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

Memory"Footprint"(KByte,"logD2"scale)"

1x1"Thread" 8x1"Threads" 16x1"Threads" 16x2"Threads"

PerformanceOnSPARCT4-2

Maxspeedupis~5.8x

Scalingonlargermatricesisaffectedbycc-NUMAeffects(similarasonNehalem)

Notethattherearenoidlecyclestofillhere

NotaDon:Numberofcoresxnumberofthreadswithincore

System: SPARC T4 with 2 sockets, 16 cores, 128 threads at 2.85 GHz

65

0"

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

Memory"Footprint"(KByte,"logD2"scale)"

1x1"Thread" 8x1"Threads" 16x1"Threads" 16x2"Threads"

DataPlacementMa[ers!

Maxspeedupis~11.3x

Theonlychangeisthewaythedataisdistributed

NotaDon:Numberofcoresxnumberofthreadswithincore

System: SPARC T4 with 2 sockets, 16 cores, 128 threads at 2.85 GHz

66

SummaryMatrixTimesVector

0"

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

Memory"Footprint"(KByte,"logD2"scale)"

Nehalem"OMP"16"threads" Nehalem"OMPDFT"16"threads"

T4D2"OMP"16"threads" T4D2"OMPDFT"16"threads"

1.9x

2.1x

67

TheWrapping

68

WrappingThingsUp“Whilewe'resDllwaiDngforyourMPIdebugruntofinish,IwanttoaskyouwhetheryoufoundmyinformaDonuseful.”

“Yes,itisoverwhelming.Iknow.”

“AndOpenMPissomewhatobscureincertainareas.Iknowthataswell.”

69

WrappingThingsUp“Iunderstand.You'renotaComputerScienDstandjustneedtogetyourscienDficresearchdone.”

“IagreethisisnotagoodsituaDon,butitisallaboutDarwin,youknow.I'msorry,itisatoughworldoutthere.”

70

ItNeverEnds“Oh,yourMPIjobjustfinished!Great.”“Yourprogramdoesnotwriteafilecalled'core'anditwasn'ttherewhenyoustartedtheprogram?”“Youwonderwheresuchafilecomesfrom?Let'stalk,butIneedtogetabigandstrongcoffeefirst.”

“WAIT!Whatdidyoujustsay?”

71

ItReallyNeverEnds“SomebodytoldyouWHAT??”

“YouthinkGPUsandOpenCLwillsolveallyourproblems?”

“Let'smakethatanXLTripleEspresso.I’llbuy”

72

Thank You And ..... Stay Tuned ! ruud.vanderpas@oracle.com

73

RuudvanderPasDis.nguishedEngineer

ArchitectureandPerformanceSPARCMicroelectronics,Oracle

SantaClara,CA,USA

DTrace Why It Can Be Good For You

74

n DTrace is a Dynamic Tracing Tool à Supported on Solaris, Mac OS X and some Linux flavours

n Monitors the Operating System (OS) n Through “probes”, the user can see what the OS is

doing n Main target: OS related performance issues n Surprisingly, it can also greatly help to find out what

your application is doing though *

Mo.va.on

*)Aregularprofilingtoolshouldbeusedfirst

75

n A DTrace probe is written by the user

n When the probe “fires”, the code in the probe is executed

n The probes are based on “providers” n The providers are pre-defined

à Example: “sched” provider to get info from the scheduler

à You can also instrument your own code to have DTrace

probes, but there is little need for that

HowDTraceWorks

provider:module:function:name

76

Example–ThreadAffinity/1

sched:::off-cpu/ pid == $target && self->on_cpu == 1 /{ self->time_delta = (timestamp - ts_base)/1000;

@thread_off_cpu [tid-1] = count(); @total_thr_state[probename] = count();

printf("Event %8u %4u %6u %6u %-16s %8s\n", self->time_delta, tid-1, curcpu->cpu_id, curcpu->cpu_lgrp, probename, probefunc);

self->on_cpu = 0; self->time_delta = 0;}

“predicate”

77

Example–ThreadAffinity/2pid$target::processor_bind:entry/ (processorid_t) arg2 >= 0 /{ self->time_delta = (timestamp - ts_base)/1000; self->target_processor = (processorid_t) arg2;

@proc_bind_info[tid-1, curcpu->cpu_id, self->target_processor] = count();

printf("Event %8u %4u %6u %6u %9s/%-6d %8s\n", self->time_delta, tid-1, curcpu->cpu_id, curcpu->cpu_lgrp, "proc_bind", self->target_processor, probename);

self->time_delta = 0; self->target_processor = 0;}

78

Example–ExampleCode#pragma omp parallel{ #pragma omp single { printf("Single region executed by thread %d\n", omp_get_thread_num()); } // End of single region} // End of parallel region

export OMP_NUM_THREADS=4export OMP_PLACES=coresexport OMP_PROC_BIND=close./affinity.d -c ./omp-par.exe

79

Example–Result=========================================================== Affinity Statistics===========================================================

Thread On HW Thread Lgroup Created Thread Count 0 787 7 1 1 0 787 7 2 1 0 787 7 3 1

Thread Running on HW Thread Bound to HW Thread 0 787 784 1 771 792 2 848 800 3 813 808

80

Thank You And ..... Stay Tuned ! ruud.vanderpas@oracle.com

Recommended