Argobots and its Application to Charm++ · 1.0 1.5 2.0 2.5 3.0 3.5 1 3 5 7 9 11 13 15 17 19 21 23...

Preview:

Citation preview

SangminSeo

AssistantComputerScientistArgonneNationalLaboratory

sseo@anl.gov

April19,2016

Argobots and its Application to Charm++

Charm++Workshop2016

Argo Concurrency Team

• ArgonneNationalLaboratory(ANL)– Pavan Balaji (co-lead)– SangminSeo– AbdelhalimAmer– MarcSnir– PeteBeckman(PI)

• UniversityofIllinoisatUrbana-Champaign(UIUC)– Laxmikant Kale(co-lead)– Prateek Jindal– JonathanLifflander

• UniversityofTennessee,Knoxville(UTK)– GeorgeBosilca– ThomasHerault– DamienGenet

• PacificNorthwestNationalLaboratory(PNNL)– Sriram Krishnamoorthy

PastTeamMembers:• CyrilBordage (UIUC)• EstebanMeneses

(UniversityofPittsburgh)• Huiwei Lu(ANL)• Yanhua Sun (UIUC)

Charm++Workshop2016 2

Massive On-node Parallelism

• Thenumberofcoresisincreasing• Massiveon-nodeparallelismisinevitable• Existingsolutionsdonoteffectivelydealwithsuchparallelismwith

respecttoon-nodethreading/taskingsystemsorwithrespecttooff-nodecommunicationinthepresenceofsuchtasks/threads

• Howtoexploit?

core

Core-levelParallelism3Charm++Workshop2016

Shortcomings today? Pthreads (1/2)

Nesting

int in[1000][1000], out[1000][1000];

#pragma omp parallel for

for (i = 0; i < 1000; i++) {

petsc_voodoo(i);

}

petsc_voodoo(int x)

{

#pragma omp parallel for

for (j = 0; j < 1000; j++)

out[x][j] = cosine(in[x][j]);

}

Executiontimefor36threadsintheouterloop

WhyistraditionalOpenMP’s performancesobad?Thecompilercannotanalyzepetsc_voodoo toknowwhetherthefunctionmighteverblockoryield,so ithastoassumethatitmight.Thereforeastackisneededtofacilitateit.CreatingadditionalPthreads foreachnestingisthesimplestwaytoachievethis.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

Time(s)

#OMPThreads|ArgobotsULTs/tasks(innerloop)

GCC/pthreads GCC/ArgobotsULTs GCC/Argobotstasks

Lowerisbetter

Charm++Workshop2016 4

Shortcomings today? Pthreads (2/2)

TasksofapplicationmappedtoagroupofPthreads

Needlightweightmechanismstoswitchtasks!computation

communication

C

Pthreads

C C C

map&schedule

Howaboutthesecommunications?Waitorcontextswitch?

Workunitsintermixedwithblockingcalls (suchascommunication calls)cancauseidlecores

Charm++Workshop2016 5

Outline

• Background• Argobots• Charm++withArgobots• OtherProgrammingModels• Summary

Charm++Workshop2016 6

User-Level Threads (ULTs)

• Whatisuser-levelthread(ULT)?– Providesthreadsemanticsinuserspace– Executionmodel:cooperativetimesharing

• MorethanoneULTcanbemappedtoasinglekernelthread

• ULTsonthesameOSthreaddonotexecuteconcurrently• Wheretouse?

– Tobetteroverlapcomputationandcommunication/IO– Toexploitfine-grainedtaskparallelism

timeline

Context switch

Context switch

ULT1

ULT2

Core Core Core Core Core Core Core Core

ULTs :

Kernel threads :

Charm++Workshop2016 7

Pthreads vs. ULTs

• Averagetimeforcreatingandjoiningonethread• pthread:6.6us- 21.2us(avg.34,953cycles)• ULT(Argobots):78ns- 130ns(avg.191cycles)• ULTis64x- 233xfaster thanPthread

– HowfastisULT?• L1$access:1.112ns,L2$access:5.648ns,memory access:18.4ns• Context switch (2processes):1.64us

1

10

100

1000

10000

100000

1 2 4 8 16 32 64 128 256 512 1024 2048

Avg.Create&JoinTime/thread

(ns)

NumberofThreads

pthread ULT(Argobots)

*measuredusingLMbench3

Charm++Workshop2016 8

Growing Interests in ULTs

• ULTandtasklibraries– Conversethreads,Qthreads,MassiveThreads,Nanos++,

Maestro,GnuPth,StackThreads/MP,Protothreads,Capriccio,StateThreads,TiNy-threads,etc.

• OSsupports– Windowsfibers,Solaristhreads

• Languageandprogrammingmodels– Cilk,OpenMP task,C++11task,C++17coroutineproposal,

Stackless Python,Gocoroutines,etc.

• Pros– EasytousewithPthreads-likeinterface

• Cons– Runtimetriestodosomethingsmart(e.g.,work-stealing)– Thismayconflictwiththecharacteristicsanddemandsof

applications

Charm++Workshop2016 9

Argobots

Overview• Separationofmechanismsandpolicies• Massiveparallelism

– Exec.Streams guaranteeprogress– WorkUnits executetocompletion

• User-levelthreads(ULTs)vs.Tasklet• Clearlydefinedmemorysemantics

– Consistencydomains• ProvideEventualConsistency

– Softwarecanmanageconsistency

Argobots Innovations• Enablingtechnology,butnotapolicymaker

– High-levellanguages/librariessuchasOpenMP,Charm++havemoreinformationabouttheuserapplication(datalocality,dependencies)

• Explicitmodel:– Enablesdynamism,butalwaysmanaged

byhigh-levelsystems

Argobots

coreProcessor

Programming Models(MPI, OpenMP, Charm++, PaRSEC, …)

U User-Level Thread T TaskletLightweightWork Units

Exe

cutio

n S

tream

Private pool Private poolShared pool

U U

U T

TTU TU

Exe

cutio

n S

tream

Exe

cutio

n S

tream

Alow-levellightweightthreadingandtaskingframework(http://collab.cels.anl.gov/display/argobots/)

*Teammembers:Sangmin Seo,Abdelhalim Amer,Pavan Balaji (ANL),Laxmikant Kale,Prateek Jindal(UIUC)

Charm++Workshop2016 10

Argobots Execution Model

• ExecutionStreams(ES)– Sequentialinstructionstream

• Canconsistofoneormoreworkunits– Mappedefficientlytoahardware

resource– Implicitlymanagedprogresssemantics

• OneblockedEScannotblockotherESs

• User-levelThreads(ULTs)– Independentexecutionunitsinuser

space– AssociatedwithanESwhenrunning– Yieldableandmigratable– Canmakeblockingcalls

• Tasklets– Atomicunitsofwork– Asynchronouscompletionvia

notifications– Notyieldable,migratablebefore

execution– Cannotmakeblockingcalls

S

Scheduler Pool

U

ULT

T

Tasklet

E

Event

ES1 Sched

U

U

E

E

E

E

U

S

S

T

T

T

T

T

Argobots Execution Model

...

ESn

• Scheduler– Stackableschedulerwithpluggable

strategies• Synchronizationprimitives

– Mutex,conditionvariable,barrier, future• Events

– Communicationtriggers

Charm++Workshop2016 11

Explicit Mapping ULT/Tasklet to ES

• TheuserneedstomapworkunitstoESs• Nosmartscheduling,nowork-stealingunlesstheuserwants

touse

ES1

U0

U1

T1

T2

U2

U3

ES2

U4

U5

• Benefits– Allowlocalityoptimization

• ExecuteworkunitsonthesameES

– NoexpensivelockisneededbetweenULTsonthesameES

• Theydonotrunconcurrently• Aflagisenough

Charm++Workshop2016 12

Stackable Scheduler with Pluggable Strategies

• AssociatedwithanES• CanhandleULTsandtasklets• Canhandleschedulers

– Allowstostackschedulershierarchically• Canhandleasynchronousevents• Userscanwriteschedulers

– Providesmechanisms,notpolicies– Replacethedefaultscheduler

• E.g.,FIFO,LIFO,PriorityQueue,etc.• ULTcanexplicitlyyieldto anotherULT

– Avoidscheduleroverhead

Sched

U

U

E

E

E

E

U

S

S

T

T

T

T

T

U S U U U

yield() yield_to(target)

Charm++Workshop2016 13

Performance: Create/Join Time

• Idealscalability– IftheULTruntimeisperfectlyscalable,thetimeshouldbethesame

regardlessofthenumberofESs

10

100

1000

10000

1 2 4 8 16 24 32 36 40 48 56 64 72

Create/Jo

inTimeperU

LT(cycles)

NumberofExecutionStreams(Workers)

Qthreads MassiveThreads (H) MassiveThreads (W)

Argobots(ULT) Argobots(Tasklet)

Charm++Workshop2016 14

JonathanLifflander,Prateek Jindal,Yanhua SunLaxmikant Kale

UniversityofIllinoisatUrbana-Champaign(UIUC)

Charm++ with Argobots

15Charm++Workshop2016

Charm++ with Argobots

• Goals– TestthecompletenessandperformanceofArgobotswith

Charm++programmingmodel– TakeadvantageofArgobots features(tasklets,stackable

schedulers,etc.)withoutmodifyingapplicationcodes– ForCharm++applications,interoperatewithapplications

writteninothermodels(MPI,Cilk,etc.)

16

Mini-apps and real world applications

Charm++ model

Converse runtime(threading, messaging, scheduler)

Communication libraries (MPI, uGNI, PAMI, Verbs)

Intelligent runtime

Argobots(ULTs, Tasks, scheduling, etc.)

Charm++ infrastructure Charm++ with Argobots *Teammembers:LaxmikantKale,JonathanLifflander, PrateekJindal (UIUC)

Charm++Workshop2016

Replacing the Converse Runtime with Argobots

• Converse– TheactivemessaginglayerinCharm++

• Approaches– EachCharm++Pthread insideanode(includingthecommunication

thread)isimplementedasanArgobots ES• CreateanESforeveryConverseinstance

– AcustomArgobots scheduleriscreatedinsteadofusingtheConversescheduler

– Conversemessagesareenqueued intoArgobots poolsastasklets– Conversethreads(CthThread)areimplementedontopofArgobots

ULTs,withconditionalvariablestoimplementsuspend/resume

• Only180linesofcodehadtobechanged!

Charm++Workshop2016 17

Converse runtime(threading, messaging, scheduler)

Argobots(ULTs, Tasks, scheduling, etc.)

LeanMD Performance: Runtime Comparison

Charm++Workshop2016 18

• Evaluationmachine– 2xIntelXeon E5-2699v3(2.30GHz):36cores (72threads)

• LeanMD simulation– Atotalof20stepsonacellarrayofdimensions7x7x7– 1-awayXYZconfigurationand1000atomspercell

Achievedcomparableperformancealthough itisaverysimpleimplementation

LeanMD Performance: Manual Implementation

Charm++Workshop2016 19

• Manualimplementation ofLeanMD usingtheArgobots– ExploitedbothULTsandtasklets

• AULTformanagingacellandatasklet formanagingtheinteractionbetweencells– Usedfuturesforthewaitingmechanism– Workstealingbetweenpools

• Betterperformanceofourmanualimplementation implies thatCharm++withArgobots couldbeimproved

Argobots: Interfaces for Shrink/Expand Events

20

ES0

Sched

ES1

Sched

ES2

Sched NRM

ESn-1

Sched...

E

E

E

Argobots

socket

Charm++ CilkBot PaRSECMPI+Argobots

programmingmodelruntimesandapplications

callbackfunctions

1. [Argobots]ConnecttoNRMusingasocketonABT_init()2. [Runtimes/applications]Registercallbackfunctionsforshrink/expandevents3. [Runtimes/applications]Deregistercallbackfunctionswhentheyterminate4. [Argobots]DisconnectfromNRMonABT_finalize()

ABT_ENV_POWER_EVENT_HOSTNAMEABT_ENV_POWER_EVENT_PORT

ABT_event_add_callback()ABT_event_del_callback()

Charm++Workshop2016

Argobots: Shrink/Expand Event Handling

• Shrinking

21

ES0

Sched

ES1

Sched

ES2

Sched

E E E

prog.model runtime/application

1. ES1 picksanevent,whichrequestsES2 tobestopped2. AsktheruntimeusingcallbackswhetherES2 canbestopped3. IfOK,markES2 toneedtostopsowhenthescheduleronES2 checksevents,

itcanbestopped4. NotifytheruntimethatES2 willbestopped5. CreateaULTonES06. WhenthescheduleronES2 stops,ES2 isterminated7. AfterES2 isterminated,theULTfreesES2 andsendsaresponsetoNRM

- Anyscheduler onanyEScancheckandhandle events- ES0 cannotbestopped

U

12

3

4

5NRM

7

6

Charm++Workshop2016

Argobots: Shrink/Expand Event Handling

• Expanding

22

ES0

Sched

ES1

Sched

ES2

Sched

E E E

prog.model runtime/application

1. ES0 picksanevent,whichrequeststocreateanES22. AsktheruntimeusingcallbackswhetheritcancreateES23. IfOK,invokeacallbackfunctionsotheruntimecreatesES24. CreateES25. SendaresponsetoNRM

1 2 3

5NRM

4

Charm++Workshop2016

Charm++ with Argobots: Implementation

• Shrink/ExpandImplementation– Charm++maintainsasetofpoolsforeachscheduler,mappedtoanES

– RanksinCharm++arevirtualizedbysavingthemappingofpooltoanES

– WhenanESisremoved,theassociatedpoolsareputintoagloballist

• TomaintaincorrectnessinCharm++,therankofanytasks/threadsinthegloballistarederivedfromthepool(ranksarevirtualized)

– Shrink• OtherESsexecuteworkunitsfromorphanedpoolswithsomeaddedsynchronization

– Expand• AnewESiscreatedandtakesoverasetoforphanedpools

23Charm++Workshop2016

LeanMD Results of Shrinking/Expanding ESs

Charm++Workshop2016 24

Argobots Ecosystem

ES1 Sched

U

U

E

E

E

E

U

S

S

T

T

T

T

T

Argobots

...

ESn

MPI+Argobots

ULT

ES

ULT

ES

MPI

Argobots runtime

Communication libraries

Charm++

Applications

Charm++Cilk “Worker”

ArgobotsES

RWSULT

FusedULT1

FusedULT2

FusedULTN

…CilkBots

PO

GE

TR

SYTR TR

PO GE GE

TR TR

SYSY GE

PO

TR

SY

PO

SY SY

PaRSEC

OpenMP MercuryRPC

Origin

Target

RPCproc

RPCproc

OmpSs

GridFTP,Kokkos,RAJA,ROSE,TASCEL,XMP,etc.ExternalConnections

Charm++Workshop2016 25

Summary

• Massiveon-nodeparallelismisinevitable– Needruntimesystemsutilizingsuchparallelism

• Argobots– Alightweightlow-levelthreading/taskingframework– Providesefficientmechanisms,notpolicies,tousers(library

developersorcompilers)• Theycanbuildtheirownsolutions

• Charm++withArgobots– ImplementedbyreplacingtheConverseruntimewithArgobots– AchievedcomparableperformanceonLeanMD– Incorporatedtheshrinking/expandingofESsinArgobots in

ordertorespondtotheexternalevents(e.g.,powercapping)

Charm++Workshop2016 26

Try Argobots

• git repository• http://git.mcs.anl.gov/argo/argobots.git/

• Documentation– Wiki

• https://collab.cels.anl.gov/display/ARGOBOTS/

– Doxygen• http://www.mcs.anl.gov/~sseo/public/argobots/

Charm++Workshop2016 27

Funding AcknowledgmentsFundingGrantProviders

InfrastructureProviders

Charm++Workshop2016

Programming Models and Runtime Systems GroupGroupLead– PavanBalaji(computerscientistandgroup

lead)

CurrentStaffMembers– Abdelhalim Amer (postdoc)– Yanfei Guo (postdoc)– RobLatham(developer)– LenaOden(postdoc)– KenRaffenetti (developer)– Sangmin Seo (assistantcomputerscientist)– MinSi(postdoc)– MinTian(visitingscholar)

PastStaffMembers– AntonioPena(postdoc)– WesleyBland(postdoc)– DariusT.Buntinas (developer)– JamesS.Dinan (postdoc)– DavidJ.Goodell (developer)– Huiwei Lu(postdoc)– YanjieWei(visitingscholar)– Yuqing Xiong (visitingscholar)

– JianYu(visitingscholar)

– Junchao Zhang(postdoc)

– Xiaomin Zhu(visitingscholar)

– AshwinAji (Ph.D.)– Abdelhalim Amer (Ph.D.)– Md.Humayun Arafat(Ph.D.)– AlexBrooks(Ph.D.)– AdrianCastello (Ph.D.)– Dazhao Cheng(Ph.D.)– JamesS.Dinan (Ph.D.)– Piotr Fidkowski (Ph.D.)– PriyankaGhosh(Ph.D.)– SayanGhosh (Ph.D.)– RalfGunter(B.S.)– Jichi Guo (Ph.D.)– Yanfei Guo (Ph.D.)– MariusHorga (M.S.)– John Jenkins(Ph.D.)– Feng Ji (Ph.D.)– PingLai(Ph.D.)– Palden Lama(Ph.D.)– YanLi(Ph.D.)– Huiwei Lu(Ph.D.)– JintaoMeng (Ph.D.)– GaneshNarayanaswamy (M.S.)– Qingpeng Niu (Ph.D.)– Ziaul Haque Olive(Ph.D.)– DavidOzog (Ph.D.)

– Renbo Pang(Ph.D.)– Sreeram Potluri (Ph.D.)– LiRao (M.S.)– Gopal Santhanaraman (Ph.D.)– ThomasScogland (Ph.D.)– MinSi(Ph.D.)– BrianSkjerven (Ph.D.)– RajeshSudarsan (Ph.D.)– LukaszWesolowski (Ph.D.)– Shucai Xiao(Ph.D.)– Chaoran Yang(Ph.D.)– Boyu Zhang(Ph.D.)– Xiuxia Zhang(Ph.D.)– XinZhao(Ph.D.)

Advisory Board– PeteBeckman(seniorscientist)– RustyLusk (retired,STA)– MarcSnir (divisiondirector)– RajeevThakur (deputydirector)

Current andRecentStudents

Charm++Workshop2016

Q&A

§ Thankyouforyourattention!

Questions?

Charm++Workshop2016

Recommended