28
PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications Yanhua Sun, Jonathan Lifflander, Laxmikant V. Kal´ e Parallel Programming Laboratory University of Illinois at Urbana-Champaign [email protected] June 10, 2014 Yanhua Sun Parallel Programming Laboratory, UIUC 1/25

PICS - a Performance-analysis-based Introspective Control ... · PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications Yanhua Sun, Jonathan

  • Upload
    ngophuc

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

PICS - a Performance-analysis-based IntrospectiveControl System to Steer Parallel Applications

Yanhua Sun, Jonathan Lifflander, Laxmikant V. Kale

Parallel Programming LaboratoryUniversity of Illinois at Urbana-Champaign

[email protected]

June 10, 2014

Yanhua Sun Parallel Programming Laboratory, UIUC 1/25

Motivation

1 Modern parallel computer systems are becoming extremely complexdue to network topologies, hierarchical storage systems,heterogeneous processing units, etc.

2 Obtaining the best performance is challenging.

3 Moreover, multiple configurations for the same application.

512

1 2 4 8 16 32 64

tim

e(us)

Number of messages

time of using different number of messages to send data

1M (f:0.03125)

Yanhua Sun Parallel Programming Laboratory, UIUC 2/25

Motivation

1 Modern parallel computer systems are becoming extremely complexdue to network topologies, hierarchical storage systems,heterogeneous processing units, etc.

2 Obtaining the best performance is challenging.

3 Moreover, multiple configurations for the same application.

512

1 2 4 8 16 32 64

tim

e(us)

Number of messages

time of using different number of messages to send data

1M (f:0.03125)

Yanhua Sun Parallel Programming Laboratory, UIUC 2/25

Introspection and Adaptivity

General Observation

Configurations of tunable parameters in the runtime system andapplications significantly affect the performance.

Top Ten Exascale Research Challenges in DOE Report

”Introspection and automatic adaptation is listed as significant researchtopic to achieve the performance goal on exascale computers.”

Statement

This work addresses the problem of how to improve both parallelprogramming productivity and performance by letting applications/runtimeexpose tunable parameters and letting the control system figure out theoptimal configurations of these parameters.

Yanhua Sun Parallel Programming Laboratory, UIUC 3/25

Introspection and Adaptivity

General Observation

Configurations of tunable parameters in the runtime system andapplications significantly affect the performance.

Top Ten Exascale Research Challenges in DOE Report

”Introspection and automatic adaptation is listed as significant researchtopic to achieve the performance goal on exascale computers.”

Statement

This work addresses the problem of how to improve both parallelprogramming productivity and performance by letting applications/runtimeexpose tunable parameters and letting the control system figure out theoptimal configurations of these parameters.

Yanhua Sun Parallel Programming Laboratory, UIUC 3/25

Related work

Autotuning frameworks : generate multiple implementations (FFTW)

Autopilot[Ribler et al.(1998)]: fuzzy logic rules, grid applications,resource managements

MATE [Morajko 2006] : fully automatic tuning, performance model

Active Harmony[Chung and Hollingsworth(2006)] : heuristicalgorithms

SEEC: A General and Extensible Framework for Self-AwareComputing[Henry Homann (2010,2011,2013)]

Yanhua Sun Parallel Programming Laboratory, UIUC 4/25

Our Approach

HPC applications on large scale

Not rely on performance models

Richer set of tunable parameters due to the powerful intelligentruntime system

Not only application configurations are tunned, but also the runtimesystem itself

Automatic performance analysis accelerates steering

Yanhua Sun Parallel Programming Laboratory, UIUC 5/25

Outline

Overview of PICS framework

Control points in the runtime system and applications

Automatic performance analysis to accelerate steering

APIs implemented in Charm++

Results of benchmarks and applications

Yanhua Sun Parallel Programming Laboratory, UIUC 6/25

Overview of PICS framework

Performance instrumenta/on 

Automa/c performance analysis 

Run/me control points 

Run/me reconfigura/on 

Controller 

Mini apps  Real‐world applica/ons 

Applica/on control points 

Applica/on reconfigura/on 

PICS  

Adap/ve run/me system 

applica/ons 

Performance       data 

Expert  knowledge rules 

Yanhua Sun Parallel Programming Laboratory, UIUC 7/25

Control Points

Control points

Control points are tunable parameters for application and runtime tointeract with control system. First proposed in Dooley’s research.

1 Name, Values : default, min, max

2 Movement unit: +1,×23 Effects, directions

Degree of parallelismGrainsizePriorityMemory usageGPU loadMessage sizeNumber of messagesother effects

Yanhua Sun Parallel Programming Laboratory, UIUC 8/25

Application and Runtime Control Points

Application

1 Application specific control points provided by users

2 Applications should be able to reconfigure to use new values

Runtime1 Traditionally, configurations for the runtime system do not change

2 Configurations for the runtime system itself should be tunable

1 Registered by runtime itself

2 Requires no change from applications

3 Affect all applications

Yanhua Sun Parallel Programming Laboratory, UIUC 9/25

Observe Program Behaviors

Record all events

Events : begin idle, end idleFunctions: name, begin execution, end executionCommunication : message creation, size, source/destinationHardware counters

Module link, no source code modification

Performance summary data

Yanhua Sun Parallel Programming Laboratory, UIUC 10/25

Automatically Analyze the Performance

Many control points are registered. How to reduce the search space?

Performance Analysis - Identify Program Problems

Decomposition

Mapping

Scheduling

Yanhua Sun Parallel Programming Laboratory, UIUC 11/25

Automatically Analyze the Performance

Many control points are registered. How to reduce the search space?Performance Analysis - Identify Program Problems

Decomposition

Mapping

Scheduling

Yanhua Sun Parallel Programming Laboratory, UIUC 11/25

Decomposition Characteristics

Decomposit ion

problem?

High cache miss rate

(1)too big

en t ry me thod

Bytes per

message low

too much

communicat ion

on one object

Decrease

grain size

(2)too big

single object

(3) too much

critical path

(4)too few objects

per PE

Increase

grain sizeReplicate the objects

Yanhua Sun Parallel Programming Laboratory, UIUC 12/25

Mapping Characteristics

Mapping problem?

load imbalance

too much

communicat ion

on one PE

Communica t ion t ime >>

LogP model time

too much externa l

communicat ion

Load

balancer

Remap

Topology aware mapping

Yanhua Sun Parallel Programming Laboratory, UIUC 13/25

Scheduling Characteristics

scheduling problem?

Critical tasks

a re de layed

Prior i t ize the tasks

Yanhua Sun Parallel Programming Laboratory, UIUC 14/25

Other Characteristics

other problems?

Bytes per

message low

Reduction

broadcas tLong latency

Aggregate

MessageCollectives

Compress

m e s s a g e

Yanhua Sun Parallel Programming Laboratory, UIUC 15/25

Correlate Performance with Control Points

Performance

summary

CPU Util ization > 90% O v e r h e a d > 1 0 % Idle >10%

Sequent ia l

performance?

Cache Miss

> 1 0 %

Decrease

grain size

Small

en t ry me thods

Small Bytes

pe r message

Increase

grain size

Decomposit ion

problem?Mapping problem? Scheduling problem?Others?

Longer

e n t r y

m e t h o d

Larger

single

object

Long

critical

p a t h

Few

objects

per PE

Large

communicat ion

on one object

Decrease

grain size

Load imbalance

Large

communicat ion

on one PE

Communicat ion

t i m e > >

model t ime

Large

ex te rna l

communicat ion

Load

balancer

RemapCompress

m e s s a g e

Critical

t a s k s

are de layed

Prior i t ize the tasks

Large Bytes

pe r message

Long reduction

broadcas t

Long latency

for big msgs

Increase

aggrega t ion

threshold

Decrease

aggrega t ion

threshold

Collectives Replicate objects Topology aware mapping

One box can have multiple children

One egg can have multiple parents

Yanhua Sun Parallel Programming Laboratory, UIUC 16/25

Correlate Performance with Control Points

Traverse the tree using the performance summary results

performance results ⇒ solutions

solution ⇒ effect of control points

What control points to tune, in which direction!

How much?

grainsize : MaxObjLoadAvgLoad

Feed results into the control points database

Yanhua Sun Parallel Programming Laboratory, UIUC 17/25

Control System APIs

Implemented in Charm++, over-decomposition, asynchronous,message-driven model. (http://charm.cs.uiuc.edu/)

t y p ed e f s t r u c t C o n t r o l P o i n t t{

cha r name [ 3 0 ] ;enum TP DATATYPE data type ;doub l e d e f a u l tV a l u e ;doub l e c u r r e n tVa l u e ;doub l e minValue ;doub l e maxValue ;doub l e be s tVa l u e ;doub l e moveUnit ;i n t moveOP ;i n t e f f e c t ;i n t e f f e c t D i r e c t i o n ;i n t s t r a t e g y ;i n t entryEP ;i n t ob j e c t ID ;

} Con t r o lPo i n t ;

Yanhua Sun Parallel Programming Laboratory, UIUC 18/25

APIs for applications

vo i d r e g i s t e r C o n t r o l P o i n t ( Con t r o lPo i n t ∗ tp ) ;

v o i d s t a r t S t e p ( ) ;v o i d endStep ( ) ;

doub l e getTunedParameter ( con s t cha r ∗name , boo l ∗ v a l i d ) ;

Yanhua Sun Parallel Programming Laboratory, UIUC 19/25

Experimental Results of Benchmarks and Applications

1 Control points

2 Performance problems

3 Bluegene/Q machine, Cray XE6 machine

Yanhua Sun Parallel Programming Laboratory, UIUC 20/25

Tuning Message Pipeline

Control point: number of pipeline messages

0

2

4

6

8

10

12

14

16

10 20 30 40 50 60 70 80 0

2

4

6

8

10

12

14

16tim

este

p(m

s/st

ep)

num

ber o

f pip

elin

e m

essa

ges

step

timestep(less work)timestep(more work)

pipeline(less work)pipeline(more work)

Figure: Tuning the number of pipeline messages

Yanhua Sun Parallel Programming Laboratory, UIUC 21/25

Communication Bottleneck in ChaNGa

Control points: number of mirrors

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

5 10 15 20 25

s/st

ep

steps

tune mirrors with PICSno mirrors

Figure: Time cost of calculating gravity for various mirrors and no mirror on 16kcores on Blue Gene/Q

Yanhua Sun Parallel Programming Laboratory, UIUC 22/25

Message Compression

Control points: compression algorithm for each type messageRuntime control points

0

500

1000

1500

2000

10 20 30 40 50 60

times

tep(

ms/

step

)

step

r1=0.1, r2=1.0

Figure: Steering the compression algorithm for all-to-all benchmark

Yanhua Sun Parallel Programming Laboratory, UIUC 23/25

Jacobi3d Performance Steering

Control Points: sub-block size in each dimensionThree control pointsCache miss rate, high idle suggest decreasing sub-block sizeOverhead

0

0.5

1

1.5

2

2.5

3

3.5

4

5 10 15 20 25 30 35 40

times

tep(

ms/

step

)

step

total timeidle time cpu time

runtime overhead

Figure: Jacobi3d performance steering on 64 cores for problem of1024*1024*1024

Yanhua Sun Parallel Programming Laboratory, UIUC 24/25

Conclusion

Introspective control system is required to improve productivity andperformance

Automatic performance analysis helps guide performance steering

Steering both runtime system and applications are important

Implemented the system based on Charm++ programming model

Acknowledgment

This work was supported in part by NIH Grant 9P41GM104601, Center forMacromolecular Modeling and Bioinformatics. It was also supported inpart by DOE DE-AC02-06CH11357 Argo Project. This research usedresources of the Argonne Leadership Computing Facility at ArgonneNational Laboratory.

Yanhua Sun Parallel Programming Laboratory, UIUC 25/25