26
PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Embed Size (px)

DESCRIPTION

Project objectives Upgrade PAPI on BG/L Provide interface for network counters Allow Lawrence Livermore National Lab users to also have access to PAPI Using network counters to place tasks optimally on BG/L

Citation preview

Page 1: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI 3.0.8.1 on Blue Gene L

Using network performance counters to layout tasks for

improved performance

Page 2: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Presentation overview Project objectives PAPI explanation Blue Gene L explanation Current state of research

Page 3: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Project objectives Upgrade PAPI on BG/L

Provide interface for network counters

Allow Lawrence Livermore National Lab users to also have access to PAPI

Using network counters to place tasks optimally on BG/L

Page 4: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI – Intro

Courtesy of http://icl.cs.utk.edu/papi/

Page 5: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI – Intro PAPI useful to profile your own

programs. Many tools based on PAPI

PapiEx – Command line measurement tool PerfSuite – Aggregate measurement and

statistical profiling package and API HPCToolkit – Statistical profiling package Many more!

Page 6: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI – Supported platforms IBM – POWER3, 604, 604e, POWER4 Cray T3E, Cray X1 AMD – Athlon, Opteron Intel – P1 to P4, Itanium I and II UltraSparc I, II & III MIPS R10K, R12K, R14K Alpha

Page 7: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI – Generic Interface Call sequence for generic interface

PAPI_library_init – Initialize memory for PAPI’s data structures

PAPI_create_eventset – Create an empty list of events

PAPI_add_event – Add events to be counted PAPI_start – Begin counting all events within

the specified eventset PAPI_stop – Stop all counters and read their

current values

Page 8: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI – Events: Presets Presets – list of predefined events

implemented on all systems where they can be supported Not all presets available on every

architecture (e.g. BG/L has no cache lower than L3 – thus L1 cache hit preset not applicable)

Native events form the basic building blocks for PAPI presets

Page 9: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI – Events: Presets

Courtesy of http://icl.cs.utk.edu/papi/

Page 10: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI – Events: Native In addition to the predefined PAPI

preset events, the PAPI library also exposes a majority of the events native to each platform

Can be added to eventsets in the same manner as presets

Page 11: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI – Events: Native

Page 12: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI – Internals Array of eventsets is the main

portion

Page 13: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI – Other features Multiplexing – If there are not

enough hardware counters Thread safe – Profiling is thread

safe Overflow detection – Hardware

counters have limited space

Page 14: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI – PAPI2 vs PAPI3 PAPI 3 significantly reduced

overheads for starting, stopping and reading the counters

Courtesy of http://icl.cs.utk.edu/papi/

Page 15: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI – PAPI2 vs PAPI3 Better native event support in

PAPI3 Better thread support in PAPI3 Overflow and Profiling

enhancements in PAPI3 Myriad bug fixes and code cleanup

in PAPI3

Page 16: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

PAPI – PAPI2 vs PAPI3 Overlapping eventsets supported

in PAPI2 Minor changes in the API – mostly

dereferencing variables

Page 17: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Blue Gene L – Intro 65,536 nodes connected in 64 x 32

x 32 3D torus Nodes made up of PowerPC 440

embedded processors Smaller than most super

computers Consumes less power

Page 18: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Blue Gene L

Page 19: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Blue Gene L - Networks

3D torus network(node to node)

Tree network(broadcasts)

Page 20: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Blue Gene L – HW counters 48 universal performance counters 4 floating point unit counters Counters 32 bit – must use virtual

counters to prevent overflow

Page 21: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Blue Gene L – HW counters

Page 22: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Research – Overall goals Network hardware counters new Use network counters to determine

traffic between tasks Try to optimize placement of tasks

to minimize communication latency Given counts and distances: cost =

counts * distance. Minimize over all nodes

Page 23: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Research – Counting First goal to determine what is

being counted

Page 24: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Research – Networks For each MPI call – determine

which network counters are being used Tree is supposed to be for broadcasts Torus is supposed to be for point to

point communication Ambiguities in the specification

Page 25: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Research – Future decisions How to profile a target application

Manually insert PAPI instrumentation: a lot of work

Instrument binaries with counting code What information to store

All counts on each node: a lot of data Sample of all nodes: not as accurate

(what if the tasks behave / communicate differently?

Page 26: PAPI 3.0.8.1 on Blue Gene L Using network performance counters to layout tasks for improved performance

Research – Future decisions How to use collected information

Profile an application to obtain counter feedback to determine optimized static task layout

Dynamically migrate tasks in response to counters