“OpenMP Does Not Scale”

RuudvanderPasDis.nguishedEngineer

ArchitectureandPerformanceSPARCMicroelectronics,Oracle

SantaClara,CA,USA

Agendan The Myth n Deep Trouble n Get Real n The Wrapping

TheMyth

“A myth, in popular use, is something that is widely believed but false.” ......

(source: www.reference.com)

“OpenMPDoesNotScale”

ACommonMyth

AProgrammingModelCanNot“NotScale”

WhatCanNotScale:

TheImplementaDonTheSystemVersusTheResourceRequirements

Or.....You

Hmmm....WhatDoesThatReallyMean?

SomeQues.onsICouldAsk“Doyoumeanyouwroteaparallelprogram,usingOpenMPanditdoesn'tperform?”“Isee.DidyoumakesuretheprogramwasfairlywellopDmizedinsequenDalmode?”

SomeQues.onsICouldAsk

“Oh.Youjustthinkitshouldandyouusedallthecores.HaveyouesDmatedthespeedupusingAmdahl'sLaw?”

“Oh.Youdidn't.Bytheway,whydoyouexpecttheprogramtoscale?”

“No,thislawisnotanewEUfinancialbailoutplan.Itissomethingelse.”

SomeQues.onsICouldAsk“Iunderstand.Youcan'tknoweverything.HaveyouatleastusedatooltoidenDfythemostDmeconsumingpartsinyourprogram?”

“Oh.Youdidn't.Didyouminimizethenumberofparallelregionsthen?”

“Oh.Youdidn't.Itjustworkedfinethewayitwas.

“Oh.Youdidn't.Youjustparallelizedallloopsintheprogram.Didyoutrytoavoidparallelizinginnermostloopsinaloopnest?”

MoreQues.onsICouldAsk“Didyouatleastusethenowaitclausetominimizetheuseofbarriers?”“Oh.You'veneverheardofabarrier.Mightbeworthtoreadupon.”“Doallthreadsroughlyperformthesameamountofwork?”

“Youdon'tknow,butthinkitisokay.Ihopeyou'reright.”

IDon’tGiveUpThatEasily“DidyoumakeopDmaluseofprivatedata,ordidyousharemostofit?”“Oh.Youdidn't.Sharingisjusteasier.Isee.

IDon’tGiveUpThatEasily

“You'veneverheardofthateither.Howunfortunate.CouldthereperhapsbeanyfalsesharingaffecDngperformance?”“Oh.Neverheardofthateither.MaycomehandytolearnaliYlemoreaboutboth.”

“Youseemtobeusingacc-NUMAsystem.Didyoutakethatintoaccount?”

TheGrassIsAlwaysGreener...“So,whatdidyoudonexttoaddresstheperformance?”

“SwitchedtoMPI.Isee.DoesthatperformanybeYerthen?”

“Oh. You don't know. You're still debugging the code.”

GoingIntoPedan.cMode

“Whileyou'rewaiDngforyourMPIdebugruntofinish(areyousureitdoesn'thangbytheway?),pleaseallowmetotalkaliYlemoreaboutOpenMPandPerformance.”

DeepTrouble

n The transparency and ease of use of OpenMP are a mixed blessing

à Makes things pretty easy

à May mask performance bottlenecks

n In the ideal world, an OpenMP application “just performs well”

n Unfortunately, this is not always the case

OpenMPAndPerformance/1

n Two of the more obscure things that can negatively impact performance are cc-NUMA effects and False Sharing

n Neither of these are restricted to OpenMP

à They come with shared memory programming on modern

cache based systems

à But they might show up because you used OpenMP

à In any case they are important enough to cover here

OpenMPAndPerformance/2

ConsideraDonsforcc-NUMA

AGenericcc-NUMAArchitecture

CacheCoherentInterconnect

ProcessorMem

Main Issue: How To Distribute The Data ?

Local Access (fast)

Remote Access (slower)

n Important aspect on cc-NUMA systems

à If not optimal, longer memory access times and hotspots

n OpenMP 4.0 does provide support for cc-NUMA

à Placement under control of the Operating System (OS)

à User control through OMP_PLACES

n Windows, Linux and Solaris all use the “First Touch” placement policy by default

à May be possible to override default (check the docs)

AboutDataDistribu.on

for (i=0; i<10000; i++) a[i] = 0;

CacheCoherentInterconnectMem

Processor

a[0] : a[9999]

ProcessorProcessor

First Touch All array elements are in the memory of

the processor executing this thread

ExampleFirstTouchPlacement/1

for (i=0; i<10000; i++) a[i] = 0;

CacheCoherentInterconnect

Processor

a[0] : a[4999]

#pragma omp parallel for num_threads(2)

First Touch Both memories now have “their own

half” of the array

a[5000] : a[9999]

Processor

ExampleFirstTouchPlacement/2

ProcessorProcessor

GetReal

“Don’tTryThisAtHome(yet)”

TheIni.alPerformance(35GB)

0.000#

0.010#

0.020#

0.030#

0.040#

0.050#

0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#

Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5J2#Performance#(SCALE#26)#

Gameoverbeyond16threads

Thatdoesn’tscaleverywell

Let’suseabiggermachine!

Ini.alPerformance(35GB)

0.000#

0.010#

0.020#

0.030#

0.040#

0.050#

0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#

Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5J2#and#T5J8#Performance#

T5J2#SCALE#26# T5J8#SCALE#26#

Oops!Thatcan’tbetrue

Let’srunalargergraph!

Ini.alPerformance(280GB)

0.000#

0.010#

0.020#

0.030#

0.040#

0.050#

0# 4# 8# 12# 16# 20# 24# 28# 32# 36# 40# 44# 48#Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5J2#and#T5J8#Performance#

T5J2#SCALE#29# T5J8#SCALE#29#

Let’sGetTechnical

37%$ 31%$22%$ 15%$

7%$ 6%$

27%$21%$

6%$ 4%$

8%$6%$

22%$ 27%$

4%$ 5%$12%$

31%$ 34%$

3%$ 6%$ 15%$25%$ 22%$

3%$ 2%$ 4%$ 3%$ 2%$ 2%$

1$ 2$ 4$ 8$ 16$ 20$Number$of$threads$

Total$CPU$Time$Percentage$DistribuDon$(Baseline,$SCALE$26)$

Atomic$operaDons$

OMPPatomic_wait$

OMPPcriDcal_secDon_wait$

OMPPimplicit_barrier$

Other$

make_bfs_tree$

verify_bfs_tree$

TotalCPUTimeDistribu.on

Func.on1

Func.on2

0" 750" 1500" 2250" 3000" 3750" 4500" 5250" 6000" 6750" 7500" 8250" 9000"

width"(G

Elapsed"Time"(seconds)"

SPARC"T5F2"Measured"Bandwidth"(BASE,"SCALE"28,"16"threads)""

Total" Read"(Socket"0)" Read"(Socket"1)" Write"(Socket"0)" Write"(Socket"1)"

BandwidthOfTheOriginalCode

Lessthanhalfofthememorybandwidthisused

SummaryOriginalVersion

•  Communica4oncostsaretoohigh–  Increasesasthreadsareadded– Thisseriouslylimitsthenumberofthreadsused– Thisisturnaffectsmemoryaccessonlargergraphs

•  Thebandwidthisnotbalanced•  Fixes:

– FindandfixmanyOpenMPinefficiencies– Usesomeefficientatomicfunc4ons

Methodology

UseTheChecklistToIdenDfyBoYlenecks

UseAProfilingTool

IfTheCodeDoesNotScaleWell

TackleThemOneByOne

ThisIsAnIncrementalApproach

ButVeryRewarding

BO BOSecretSauce

NotethemuchshorterrunDmeforthemodifiedversion

ComparisonOfTheTwoVersions

Atomic/cri4calsec4onwait

Implicitbarrier

Idlestate

0# 16# 32# 48# 64# 80# 96# 112#128#144#160#176#192#208#224#240#256#Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5L2#Performance#(SCALE#29)#

T5L2#Baseline# T5L2#OPT1#

Peakperformanceis13.7xhigher

PerformanceComparison

BO MOMore

SecretSauce

Observa.ons

ButNeedsIt....

TheCodeDoesNotExploitLargePages

FirstTouchPlacementIsNotUsed

UsedASmarterMemoryAllocator

0" 400" 800" 1200" 1600" 2000" 2400" 2800" 3200" 3600" 4000"

width"(G

Elasped"Time"(seconds)"

SPARC"T5E2"Measured"Bandwidth"(OPT2,"SCALE"29,"224"threads)""

Total"BW"(GB/s)" Read"(GBs/s)" Write"(GB/s)"

BandwidthOfTheNewCode

MaximumReadBandwidth135GB/s

TheResult

0.04 0.03 0.03

0.790.89

1.731.67

SCALE29(282GB) SCALE30(580GB) SCALE31(1150GB)

Billion

Searche

erSecon

d(GTEPS)

T5-8OPT0(BO) T5-8OPT1(BO) T5-8OPT2(MO)

39-52ximprovementoveroriginalcode

1" 2" 4" 8" 16" 32" 64" 128" 256" 384" 512"

sed"Time"(m

inutes)"

Number"of"threads"

SPARC"T5E8"Graph"Search"Performance""(SCALE"30"E"Memory"Requirements:"580"GB)"

Search"Time"(minutes)" Speed"Up"

BiggerIsDefinitelyBe[er!

72xspeedup!

SearchDmereducedfrom12hoursto10

minutes

0# 64# 128# 192# 256# 320# 384# 448# 512# 576# 640# 704# 768# 832# 896# 960#1024#

Billion

#Traversed

#Edges#per#Secon

d#(GTEPS)#

Number#of#threads#

SPARC#T5L8#(SCALE#32,#OPT2)##

A2.3TBSizedProblem

EvenstarDngat32threadsthespeedupissDll11x

896Threads!

10"15"20"25"30"35"40"45"50"55"

26" 27" 28" 29" 30" 31"

Speed"Up"Over"T

he"OPT

rsion"

SCALE"Value"

SPARC"T5D8"Speed"Up"Over"OPT0"

OPT1" OPT2"

TuningBenefitBreakdown

Somewhatdiminishingreturn

Biggerisbe[er

MO MOBODifferent

SecretSauce

57-75ximprovement

0.04 0.03 0.03

0.790.89

1.73 1.67

2.43 2.42

SCALE29(282GB) SCALE30(580GB) SCALE31(1150GB)

Billion

Searche

erSecon

d(GTEPS)

T5-8OPT0(BO) T5-8OPT1(BO) T5-8OPT2(MO) T5-8OPT2(MOBO)

ASimpleOpenMPChange

“IValueMyPersonalSpace”

MyFavoriteSimpleAlgorithmvoid mxv(int m,int n,double *a,double *b[], double *c) { for (int i=0; i<m; i++) { double sum = 0.0; for (int j=0; j<n; j++) sum += b[i][j]*c[j]; a[i] = sum; } }

parallel loop

TheOpenMPSource

#pragma omp parallel for default(none) \ shared(m,n,a,b,c) for (int i=0; i<m; i++) { double sum = 0.0; for (int j=0; j<n; j++) sum += b[i][j]*c[j]; a[i] = sum; }

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

Memory"Footprint"(KByte,"logD2"scale)"

1x1"thread" 2x1"threads" 4x1"threads" 8x1"threads" 8x2"threads"

PerformanceOnIntelNehalem

Maxspeedupis~1.6xonly

Waitaminute!Thisalgorithmishighly

parallel....

NotaDon:Numberofcoresxnumberofthreadswithincore

System: Intel X5570 with 2 sockets, 8 cores, 16 threads at 2.93 GHz

Let'sGetTechnical

hwthread0hwthread1

core 0 caches

hwthread0hwthread1

core 1 caches

hwthread0hwthread1

core 2 caches

hwthread0hwthread1

core 3 caches

socket 0

hwthread0hwthread1

core 0 caches

hwthread0hwthread1

core 1 caches

hwthread0hwthread1

core 2 caches

hwthread0hwthread1

core 3 caches

socket 1

ATwoSocketNehalemSystem

DataIni.aliza.onRevisited#pragma omp parallel default(none) \ shared(m,n,a,b,c) private(i,j) { #pragma omp for for (j=0; j<n; j++) c[j] = 1.0; #pragma omp for for (i=0; i<m; i++) { a[i] = -1957.0; for (j=0; j<n; j++) b[i][j] = i; } /*-- End of omp for --*/ } /*-- End of parallel region --*/

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

1x1"thread" 2x1"threads" 4x1"threads" 8x1"threads" 8x2"threads"

DataPlacementMa[ers!

Theonlychangeisthewaythedataisdistributed

Maxspeedupis~3.3xnow

System: Intel X5570 with 2 sockets, 8 cores, 16 threads at 2.93 GHz

ATwoSocketSPARCT4-2Systemhwthread0

hwthread7

core 0 caches

hwthread0

hwthread7

core 7 caches shar

socket 0

hwthread0

hwthread7

core 0 caches

hwthread0

hwthread7

core 7 caches shar

socket 1

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

1x1"Thread" 8x1"Threads" 16x1"Threads" 16x2"Threads"

PerformanceOnSPARCT4-2

Maxspeedupis~5.8x

Scalingonlargermatricesisaffectedbycc-NUMAeffects(similarasonNehalem)

Notethattherearenoidlecyclestofillhere

System: SPARC T4 with 2 sockets, 16 cores, 128 threads at 2.85 GHz

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

1x1"Thread" 8x1"Threads" 16x1"Threads" 16x2"Threads"

DataPlacementMa[ers!

Maxspeedupis~11.3x

Theonlychangeisthewaythedataisdistributed

System: SPARC T4 with 2 sockets, 16 cores, 128 threads at 2.85 GHz

SummaryMatrixTimesVector

5,000"

10,000"

15,000"

20,000"

25,000"

30,000"

35,000"

0.25" 1" 4" 16" 64" 256" 1024" 4096" 16384" 65536" 262144"

Performan

ce"in"M

flop/s"

Nehalem"OMP"16"threads" Nehalem"OMPDFT"16"threads"

T4D2"OMP"16"threads" T4D2"OMPDFT"16"threads"

TheWrapping

WrappingThingsUp“Whilewe'resDllwaiDngforyourMPIdebugruntofinish,IwanttoaskyouwhetheryoufoundmyinformaDonuseful.”

“Yes,itisoverwhelming.Iknow.”

“AndOpenMPissomewhatobscureincertainareas.Iknowthataswell.”

WrappingThingsUp“Iunderstand.You'renotaComputerScienDstandjustneedtogetyourscienDficresearchdone.”

“IagreethisisnotagoodsituaDon,butitisallaboutDarwin,youknow.I'msorry,itisatoughworldoutthere.”

ItNeverEnds“Oh,yourMPIjobjustfinished!Great.”“Yourprogramdoesnotwriteafilecalled'core'anditwasn'ttherewhenyoustartedtheprogram?”“Youwonderwheresuchafilecomesfrom?Let'stalk,butIneedtogetabigandstrongcoffeefirst.”

“WAIT!Whatdidyoujustsay?”

ItReallyNeverEnds“SomebodytoldyouWHAT??”

“YouthinkGPUsandOpenCLwillsolveallyourproblems?”

“Let'smakethatanXLTripleEspresso.I’llbuy”

Thank You And ..... Stay Tuned ! ruud.vanderpas@oracle.com

RuudvanderPasDis.nguishedEngineer

ArchitectureandPerformanceSPARCMicroelectronics,Oracle

SantaClara,CA,USA

DTrace Why It Can Be Good For You

n DTrace is a Dynamic Tracing Tool à Supported on Solaris, Mac OS X and some Linux flavours

n Monitors the Operating System (OS) n Through “probes”, the user can see what the OS is

doing n Main target: OS related performance issues n Surprisingly, it can also greatly help to find out what

your application is doing though *

Mo.va.on

*)Aregularprofilingtoolshouldbeusedfirst

n A DTrace probe is written by the user

n When the probe “fires”, the code in the probe is executed

n The probes are based on “providers” n The providers are pre-defined

à Example: “sched” provider to get info from the scheduler

à You can also instrument your own code to have DTrace

probes, but there is little need for that

HowDTraceWorks

provider:module:function:name

Example–ThreadAffinity/1

sched:::off-cpu/ pid == $target && self->on_cpu == 1 /{ self->time_delta = (timestamp - ts_base)/1000;

@thread_off_cpu [tid-1] = count(); @total_thr_state[probename] = count();

printf("Event %8u %4u %6u %6u %-16s %8s\n", self->time_delta, tid-1, curcpu->cpu_id, curcpu->cpu_lgrp, probename, probefunc);

self->on_cpu = 0; self->time_delta = 0;}

“predicate”

Example–ThreadAffinity/2pid$target::processor_bind:entry/ (processorid_t) arg2 >= 0 /{ self->time_delta = (timestamp - ts_base)/1000; self->target_processor = (processorid_t) arg2;

@proc_bind_info[tid-1, curcpu->cpu_id, self->target_processor] = count();

printf("Event %8u %4u %6u %6u %9s/%-6d %8s\n", self->time_delta, tid-1, curcpu->cpu_id, curcpu->cpu_lgrp, "proc_bind", self->target_processor, probename);

self->time_delta = 0; self->target_processor = 0;}

Example–ExampleCode#pragma omp parallel{ #pragma omp single { printf("Single region executed by thread %d\n", omp_get_thread_num()); } // End of single region} // End of parallel region

export OMP_NUM_THREADS=4export OMP_PLACES=coresexport OMP_PROC_BIND=close./affinity.d -c ./omp-par.exe

Example–Result=========================================================== Affinity Statistics===========================================================

Thread On HW Thread Lgroup Created Thread Count 0 787 7 1 1 0 787 7 2 1 0 787 7 3 1

Thread Running on HW Thread Bound to HW Thread 0 787 784 1 771 792 2 848 800 3 813 808

Thank You And ..... Stay Tuned ! ruud.vanderpas@oracle.com

“OpenMP Does Not Scale”

Documents

OpenMP API 5.0 Page 1 OpenMP 5.0 API Syntax Reference Guide€¦ · OpenMP API 5.0 Page 1 OpenMP 5.0 API Syntax Reference Guide ® Directives and Constructs An OpenMP executable directive

Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with ... · 2005. 6. 7. · Vampir NG Holger Brunst Center for High Performance Computing Dresden University,

OpenMP - Introduction

HPC1 OpenMP E. Bruce Pitman October, 2002. HPC1 Outline What is OpenMP Multi-threading How to use OpenMP Limitations OpenMP + MPI References

1 Parallel Programming With OpenMP. 2 Contents Overview of Parallel Programming & OpenMP Difference between OpenMP & MPI OpenMP Programming Model

Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines

OpenMP Basics

OpenMP Tutorial

3D ART DOES NOT SCALE

MPI and OpenMP · •In general, MPI + OpenMP does not improve performance (and may be worse!) in the regime where the MPI application is scaling well. •Benefits come when MPI scalability

OpenMP Tasking

INTERTWinE Best Practice Guide MPI+OpenMP 1...OpenMP: these are explored in the following Sections. 2.2.1 Exploitingadditional!layers!of!parallelism! Some MPI codes do not scale beyond

OpenMP TutorialAdvanced OpenMP, SC'2001 2 SC’2000 Tutorial Agenda Summary of OpenMP basics ! OpenMP: The more subtle/advanced stuff ! OpenMP case studies ! Automatic parallelism

Advanced OpenMP - ARCHER · Advanced OpenMP Lecture 3: Cache Coherency . Cache coherency ... P3 wants to read the value. Its cache does not have a valid copy, so it places a BusRd

Parallel Programming using OpenMPweb.engr.oregonstate.edu/~mjb/cs575/Handouts/openmp.1pp.pdf · OpenMP Multithreaded Programming • OpenMP stands for “Open Multi-Processing”

Continuous Integration - Does it scale?

An efficient MPI/OpenMP parallelization of the Hartree ... · plement HF for accelerators. „e result is a hybrid MPI/OpenMP implementation that is designed to scale well on a large

Scale Does Matter - gazprom.ru

OpenMP - IDRIS

Introduction to OpenMP. OpenMP Introduction Credits: allans/cs260/lectures/OpenMP.ppt douglas/Classes/cs521-s02/...openmp/MPI-OpenMP.ppt