View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Towards a Performance Tool Interface for OpenMP:An Approach Based on
Directive Rewriting
Bernd Mohr, Felix WolfForschungszentrum Jülich
John von Neumann - Institut für Computing
Zentralinstitut für Angewandte Mathematik
52425 Jülich{b.mohr,f.wolf}@fz-juelich.de
Allen Malony, Sameer ShendeUniversity of Oregon
Department of Computer andInformation Science
Eugene, Oregon 97403{malony,sameer}@cs.uoregon.edu
© 2001 Forschungszentrum Jülich, University of Oregon [2]
Outline
• Introduction
• Proposed OpenMP Performance Tool Interface
• Prototype Implementation
• Examples
• Future Work
© 2001 Forschungszentrum Jülich, University of Oregon [3]
Introduction
• Motivation• “Standard” OpenMP performance tools interface
similar in spirit to the MPI profiling interface (PMPI)”
• Goals• Expose OpenMP parallel execution to the
performance measurement system• Define it at the abstraction level of the
OpenMP programming model• Make the performance measurement interface portable
– across different platforms– across all OpenMP supported languages– different performance tools
• Allow flexibility in how the interface is applied
© 2001 Forschungszentrum Jülich, University of Oregon [4]
Proposed OpenMP Performance Tool Interface
• POMP• OpenMP Directive Instrumentation• OpenMP Runtime Library Routine Instrumentation• Performance Monitoring Library Control• User Code Instrumentation• Context Descriptors• Conditional Compilation• Conditional / Selective Transformations
• Remarks• C/C++ OpenMP Pragma Instrumentation• Implementation Issues• Open Issues
© 2001 Forschungszentrum Jülich, University of Oregon [5]
OpenMP Directive Instrumentation
• Insert calls to pomp_NAME_TYPE(d) at appropriate places around directives•NAME name of the OpenMP construct•TYPE
–fork, join mark change in parallelism grade–enter, exit flag entering/exiting OpenMP
construct–begin, end mark start/end of body of construct
•d context descriptor
• Observation of implicit barrier atDO, SECTIONS, WORKSHARE, SINGLE constructs
• Add NOWAIT to construct• Make barrier explicit
© 2001 Forschungszentrum Jülich, University of Oregon [6]
Example: !$OMP PARALLEL DO Instrumentation
!$OMP PARALLEL DO clauses...
do loop
!$OMP END PARALLEL DO
!$OMP PARALLEL other-clauses...
!$OMP DO schedule-clauses, ordered-clauses, lastprivate-clausesdo loop
!$OMP END DO
!$OMP END PARALLEL DO
NOWAIT
!$OMP BARRIER
call pomp_parallel_fork(d)
call pomp_parallel_begin(d)
call pomp_parallel_end(d)
call pomp_parallel_join(d)
call pomp_do_enter(d)
call pomp_do_exit(d)
call pomp_barrier_enter(d)
call pomp_barrier_exit(d)
© 2001 Forschungszentrum Jülich, University of Oregon [7]
OpenMP Runtime Library Routine Instrumentation
• Transform•omp_###_lock() pomp_###_lock()•omp_###_nest_lock() pomp_###_nest_lock()
[ ### = init | destroy | set | unset | test ]
• POMP version• Calls omp version internally• Can do extra stuff before and after call
• Transformations of other OpenMP API functions necessary?
© 2001 Forschungszentrum Jülich, University of Oregon [8]
Performance Monitoring Library Control
• Give programmer control over performance monitoringat runtime•!$OMP INST [ INIT | FINALIZE | ON | OFF ]
• Translated into•pomp_init(), pomp_finalize()•pomp_on(), pomp_off()
• Ignored in “normal” OpenMP compilation mode
• Alternatives•!$POMP?• Use conditional compilation with explicit POMP calls
© 2001 Forschungszentrum Jülich, University of Oregon [9]
User Code Instrumentation
• Compiler / transformation tool should insert•pomp_begin(d)•pomp_end(d)
calls at beginning and end of each(?) user function
• Allow user-specified arbitrary (non-function) code regions•!$OMP INST BEGIN ( <region name> )
arbitrary user code !$OMP INST END ( <region name> )
• Alternatives•!$POMP?• Use conditional compilation with explicit POMP calls
descriptor?
© 2001 Forschungszentrum Jülich, University of Oregon [10]
Context Descriptors
• Describe execution contexts through context descriptortypedef struct ompregdescr { char name[]; /* construct */ char sub_name[]; /* region name */ int num_sections; char filename[]; /* src filename */ int begin_line1, begin_lineN; /* begin line # */ int end_line1, end_lineN; /* end line # */ WORD data[4]; /* perf. data */ struct ompregdescr* next;} OMPRegDescr;
• Generate context descriptors in global static memory:OMPRegDescr rd42675 = { "critical", "phase1", 0, "foo.c", 5, 5, 13, 13 };
• Pass address to POMP functions
© 2001 Forschungszentrum Jülich, University of Oregon [11]
Conditional Compilation
• C, C++, [Fortran, if supported]•#ifdef _POMP
arbitrary user code#endif
• Fortran Free Form•!P$ arbitrary user code
• Fortran Fixed Form•CP$ arbitrary *P$ user !P$ code
• Usual restrictions apply
© 2001 Forschungszentrum Jülich, University of Oregon [12]
Conditional / Selective Transformations
• (Temporarily) disable / re-enable POMP instrumentationat compile time
•!$OMP NOINSTRUMENT
•!$OMP INSTRUMENT
• Alternative:•!$POMP?
© 2001 Forschungszentrum Jülich, University of Oregon [13]
C/C++ OpenMP Pragma Instrumentation
• No END pragmas• instrumentation for “closing” part follows structured
block• adding nowait has to be done in the “opening part”
•#pragma omp XXX
structured block;
• Simple differences in language• no “call” keyword• “;”•!$OMP #pragma omp
pomp_###_begin(d);
pomp_###_end(d);
{
}
© 2001 Forschungszentrum Jülich, University of Oregon [14]
Example: #pragma omp sections Instrumentation
#pragma omp sections{
#pragma omp section
structured block;
#pragma omp section
structured block;
}
pomp_sections_enter(d);
{ pomp_section_begin(d);
pomp_section_end(d); }
{ pomp_section_begin(d);
pomp_section_end(d); }
pomp_sections_exit(d);
nowait
#pragma omp barrier
pomp_barrier_enter(d);
pomp_barrier_exit(d);
© 2001 Forschungszentrum Jülich, University of Oregon [15]
Implementation Issues
•pomp_NAME_TYPE(d) more efficient / simpler than pomp_event(POMP_TYPE, POMP_NAME, fname, line#, ...)
• Inlining of POMP calls possible• Context descriptors
• Full context information available, incl. source reference• But minimal runtime overhead
– just one argument needs to be passed– no need to dynamically allocate memory for data!!– context data initialization at compile time
• Context data is kept together with executable• Allows for separate compilation
• Potentially too much overhead for ATOMIC, CRITICAL, MASTER, SINGLE, and OpenMP lock calls --pomp-disable=construct-list
© 2001 Forschungszentrum Jülich, University of Oregon [16]
Open Issues
•ORDERED?•FLUSH?• Instrumentation of PARALLEL DO / FOR loop iterations
• Potentially allows measurement of influence of loop scheduling policies
• Overhead??• Allow passing additional user information to POMP library
• Conditional compilation• Extra parameter to !$OMP INST BEGIN/END• ...
• Specification of extent of user code instrumentation• Additional pragmas/directives?• Separate (outside source code) specification?
• OpenMP Runtime Instrumentation necessary?
© 2001 Forschungszentrum Jülich, University of Oregon [17]
Prototype Implementation: OPARI
• OOpenMP PPragma AAnd RRegion IInstrumentor (OPARI)• Source-to-Source translator to insert POMP calls around
OpenMP constructs and API functions
• Supports• Fortran77 and Fortran90, OpenMP 2.0• C and C++, OpenMP 1.0• Runtime Library Control (init, finalize, on, off)• (Manual) User Code Instrumentation (begin, end)• Conditional Compilation (#ifdef _POMP, !P$)• Conditional / Selective Transformation
([no]instrument)
• Preserves source code information (#line line file)• ~ 2000 lines of C++ code
© 2001 Forschungszentrum Jülich, University of Oregon [18]
OPARI
• Limitations• Fortran:
–END DO and END PARALLEL DO directives required– atomic expression on line by itself
• C/C++:– structured blocks: simple expression statement or
block (compound statement)– Exception: for statement after parallel for
• Could be fixed by enhancing OPARI’s parsing capabilities
• Source code and documentation available athttp://www.fz-juelich.de/zam/kojak/opari/
© 2001 Forschungszentrum Jülich, University of Oregon [19]
Prototype Implementation: POMP Library
• EXEXtensible PERPERformance TTool (EXPERT)• Automatic event trace analyzer•http://www.fz-juelich.de/zam/kojak/expert/
• TTuning and AAnalysis UUtilities (TAU)• Performance analysis framework•http://www.acl.lanl.gov/tau/
• Required ~ 1 day to implement tool specific POMP libraries
© 2001 Forschungszentrum Jülich, University of Oregon [20]
Prototype Implementation: EXPERT POMP Library
void pomp_for_enter(OMPRegDescr* r) { /* Get EPILOG region descriptor stored in r */ ElgRegion* e = (ElgRegion*)(r->data[0]);
/* If not yet there, initialize and store it */ if (! e) e = ElgRegion_Init(r);
/* Record enter event */ elg_enter(e->rid);}
void pomp_for_exit(OMPRegDescr* r) { /* Record collective exit event */ elg_omp_collexit();}
© 2001 Forschungszentrum Jülich, University of Oregon [21]
Prototype Implementation: TAU POMP Library
TAU_GLOBAL_TIMER(tfor, "for enter/exit","[OpenMP]", OpenMP);
void pomp_for_enter(OMPRegDescr* r) { #ifdef TAU_AGGREGATE_OPENMP_TIMINGS TAU_GLOBAL_TIMER_START(tfor); #endif #ifdef TAU_OPENMP_REGION_VIEW TauStartOpenMPRegionTimer(); #endif}
void pomp_for_exit(OMPRegDescr* r) { ...}
© 2001 Forschungszentrum Jülich, University of Oregon [22]
Examples
• EXPERT• REMO: Weather Forecast• DKRZ Germany• MPI + OpenMP (experimental)
• TAU• Stommel: Ocean Circulation Simulation• SDSC• MPI + OpenMP• event trace based Vampir• profile based RACY
© 2001 Forschungszentrum Jülich, University of Oregon [26]
Future Work
• Measure typical POMP calling overhead• EPCC OpenMP Microbenchmarks?
• Investigate “formal” standardization with OpenMP forum[OpenMP Supplemental Standard?]
• OpenMP programmers– What do you expect from an OpenMP performance
tool?• Tool developers:
– Download and try out OPARI– Implement POMP interface for your tool– Tell us about problems, comments, enhancements
• OpenMP ARB members– What do we need to do next?
© 2001 Forschungszentrum Jülich, University of Oregon [27]
Conclusion
• POMP OpenMP Performance Tool Interface• Portable• Flexible• Efficient• Defined at the abstraction level of the
OpenMP programming model• Standard?
• Prototype Software• OOpenMP PPragma AAnd RRegion IInstrumentor (OPARI)http://www.fz-juelich.de/zam/kojak/opari/
• TTuning and AAnalysis UUtilities (TAU)http://www.acl.lanl.gov/tau/
© 2001 Forschungszentrum Jülich, University of Oregon [29]
!$OMP PARALLEL Instrumentation
call pomp_parallel_fork(d)!$OMP PARALLEL
call pomp_parallel_begin(d)structured blockcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_parallel_end(d)
!$OMP END PARALLELcall pomp_parallel_join(d)
© 2001 Forschungszentrum Jülich, University of Oregon [30]
!$OMP DO Instrumentation
call pomp_do_enter(d)!$OMP DO
do loop!$OMP END DO NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_do_exit(d)
© 2001 Forschungszentrum Jülich, University of Oregon [31]
!$OMP WORKSHARE Instrumentation
call pomp_workshare_enter(d)!$OMP WORKSHARE
structured block!$OMP END WORKSHARE NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_workshare_exit(d)
© 2001 Forschungszentrum Jülich, University of Oregon [32]
!$OMP SECTIONS Instrumentation
call pomp_sections_enter(d)!$OMP SECTIONS!$OMP SECTION
call pomp_section_begin(d)structured blockcall pomp_section_end(d)
!$OMP SECTIONcall pomp_section_begin(d)structured blockcall pomp_section_end(d)
!$OMP END SECTIONS NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_sections_exit(d)
© 2001 Forschungszentrum Jülich, University of Oregon [33]
Synchronization Constructs Instrumentation 1
call pomp_single_enter(d)!$OMP SINGLE
call pomp_single_begin(d)structured blockcall pomp_single_end(d)
!$OMP END SINGLE NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d) call pomp_single_exit(d)
!$OMP MASTERcall pomp_master_begin(d)structured blockcall pomp_master_end(d)
!$OMP END MASTER
© 2001 Forschungszentrum Jülich, University of Oregon [34]
Synchronization Constructs Instrumentation 2
call pomp_critical_enter(d)!$OMP CRITICAL
call pomp_critical_begin(d)structured blockcall pomp_critical_end(d)
!$OMP END CRITICALcall pomp_sections_exit(d)
call pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)
call pomp_atomic_enter(d)!$OMP ATOMIC
atomic expressioncall pomp_atomic_exit(d)
© 2001 Forschungszentrum Jülich, University of Oregon [35]
Automatic Analysis
• EXEXtensible PER PERformance T Tool (EXPERT)• programmable, extensible, flexible performance
property specification• based on event patterns• analyzes along three hierarchical dimensions
– performance properties (general specific)– dynamic call tree position– location (machine node process thread)
• Done: fully functional demonstration prototype• Work in Progress:
– optimization / generalization– more performance properties– source code and time line displays
© 2001 Forschungszentrum Jülich, University of Oregon [36]
Expert Result Presentation
• Interconnectedweighted treebrowser
• scalable still accurate• Each node has weight
• Percentage of CPU allocation time• i.e. time spent in subtree of call tree
• Displayed weight depends on state of node• Collapsed (including weight of descendants)• Expanded (without weight of descendants)
• Displayed using• Color: allows to easily identify hot spots (bottlenecks)• Numerical value: Detailed comparison
100 main
60 bar
10 main
30 foo
© 2001 Forschungszentrum Jülich, University of Oregon [37]
Performance Properties View
Main Problem:Idle Threads
Fine:User code
Fine:OpenMP +MPI
Fine:OpenMP +MPI
© 2001 Forschungszentrum Jülich, University of Oregon [38]
Dynamic Call Tree View
1st Optimization Opportunity
2nd Optimization Opportunity
3rd Optimization Opportunity