Upload
tianna-trick
View
217
Download
0
Embed Size (px)
Citation preview
Active Harmony and the Chapel HPC
LanguageRay Chen, UMD
Jeff Hollingsworth, UMDMichael P. Ferguson, LTS
Harmony Overview• Harmony system based on feedback loop
2
Harmony Server
Application
ParameterValues
MeasuredPerformance
Simplex AlgorithmsNelder-Mead
Parallel Rank Ordering
3
Tuning Granularity• Initial Parameter Tuning
o Application treated as a black boxo Test parameters delivered during application launcho Application executes once per test configuration
• Internal Application Tuningo Specific internal functions or loops tunedo Possibly multiple locations within applicationo Multiple executions required to test configurations
• Run-time Tuningo Application modified to communicate with server mid-runo Only one run of the application needed
4
Example Application• SMG2000
o 6-dimensional spaceo 3 tiling factorso 2 unrolling factorso 1 compiler choice
o 20 search steps
• Performance gaino 2.37x for residual computationo 1.27x for on full application
5
The Irony of Auto-Tuning
• Intensely manual processo High cost of adoption
• Requires application specific knowledgeo Tunable variable identificationo Value range determinationo Hotspot identificationo Critical section modification at safe points
• Can auto-tuning be more automatic?
6
Towards AutomaticAuto-tuning
• Reducing the burden on the end-user
• Three questions must be answeredo What parameters are candidates for auto-tuning?o Where are the best code regions for auto-tuning?o When should we apply auto-tuning?
7
Our Goals• Maximize return from minimal investment
o Use profiling feature as a modelo Should be enabled with a runtime flag
o Aim to provide auto-tuning benefits within one execution
• Minimize language extensiono Applications should be used as originally written
• Non-trivial goals with C/C++/Fortrano Are there any alternatives?
8
Chapel Overview• Parallel programming language
o Led by Cray Inc.o “Chapel strives to vastly improve the programmability of large-
scale parallel computers while matching or beating the performance and portability of current programming models like MPI.”
9
Type of HW Parallelism Programming Model Unit of Parallelism
Inter-node MPI executable
Intra-node/multi-core OpenMP/pthreads iteration/task
Instruction-level vectors/threads
pragmas iteration
GPU/accelerator CUDA/OpenCL/OpenAcc SIMD function/taskContent courtesy of Cray Inc.
Chapel Methodology
10Content courtesy of Cray Inc.
Chapel Data Parallelism
• Only domains and forall loop requriedo Forall loop used with arrays to distribute worko Domains used to control distribution
o A generalization of ZPL’s region concept
11Content courtesy of Cray Inc.
Chapel Task Parallelism
• Three constructs used to express control-based parallelism
o begin – “fire and forget”o cobegin – heterogeneous taskso coforall – homogeneous tasks
12
begin writeln(“hello world”);writeln(“good bye”);cobegin { consumer(1); consumer(2); producer();} // wait here for all three tasks to complete
begin producer();coforall 1 in 1..numConsumers { consumer(i);} // wait here for all consumers to return
Content courtesy of Cray Inc.
Chapel Locales
• MPI (SPMD) Functionality
13
writeln(“start on locale 0”);onLocales(1) do writeln(“now on locale 1”);writeln(“on locale 0 again”);
proc main() { coforall loc in Locales do on loc do MySPMDProgram(loc.id, Locales.numElements);}
proc MySPMDProgram(me, p) { println(“Hello from node ”, me);}
Content courtesy of Cray Inc.
Chapel Config Variables
14
config const numLocales: int;const LocaleSpace: domain(1) = [0..numLocales-1];const Locales: [LocaleSpace] locale;
% a.out --numLocales=4Hello from node 3Hello from node 0Hello from node 1Hello from node 2
Content courtesy of Cray Inc.
Leveraging Chapel• Helpful design goals
o Expressing parallelism and locality is the user’s responsibilityo Not the compiler’s
• Chapel source effectively pre-annotatedo Config variables help to locate candidate tuning parameterso Parallel looping constructs help to locate hotspots
15
Current Progress• Harmony Client API ported to Chapel
o Uses Chapel’s foreign function interfaceo Chapel client module to be added to next Harmony release
• Achieves the current state of auto-tuningo What to tune
o Parameters must determined by a domain experto Manually register each parameter and value range
o Where to tuneo Critical loop must be determined by a domain experto Manually fetch and report performance at safe points
o When to tuneo Tuning enabled once manual changes are complete
16
Improving the “What”• Leverage Chapel’s “config” variable type
o Helpful for everybody to extend syntax slightly
• Not a silver bulleto False-positives and false-negatives definitely existo Goes a long way towards reducing candidate variableso Chapel built-in candidate variables
config const someArg = 5;
17
dataParTasksPerLocaledataParIgnoreRunningTasksdataParMinGranularitynumLocales
config const someArg = 5 in 1..100 by 2;
Improving the “Where”
• Naïve approacho Modify all parallel loop constructs
o Fetch new config values at loop heado Report performance at loop tail
o Use PRO to efficiently search parameter space in parallel
• Poses open questionso How to know if config values are safe to modify mid-execution?o How to handle nested parallel loops?o How to prevent overhead explosion?
• Solutions outside the scope of this projecto But we’ve got some ideas...
18
What’s Possible?• Target pre-run optimization instead
o Run small snippet of code pre-maino Determine optimal values to be used prior to execution
• Example: Cache optimizationo Explore element size and strideo Pad array elements to fit sizeo Define domains
o Automatically optimize for cache size and eviction strategyo Further increase performance portability
• Generate library of performance unit-testso Bundle with Chapel for distribution
19
Improving the “When”• Auto-tuning should be simple to enable
o Use profiling as a model (just add –pg to the compiler flags)
• System should be self-relianto Local server must be launched with application
20
Open Questions• Automatic hotspot detection
o Time spent in loopo Variables manipulated in loopo How to determine correctness-safe modification points
o Static analysis?
• Moving to other languageso C/Fortran lacking needed annotationso More static analysis?
• Why avoid language extension?o Is it really so bad?
21