John Curreri Seth Koehler Rafael Garcia Project F2: Application Performance Analysis

John Curreri

Seth Koehler

Rafael Garcia

Project F2: Application Project F2: Application Performance AnalysisPerformance Analysis

Outline Introduction

Application mappers Historical background Performance analysis today

HLL runtime performance analysis tool Motivation Instrumentation Framework Visualization

Case study Molecular Dynamics

Conclusions & References

Application Mappers Translates C code to HDL

Higher level of abstraction Usually a subset of ANSI C

No pointers No standard C libraries for FPGA

HDL is generated as a project file for Xilinx or Altera tools

Built-in communication Separate C source files are

made for the CPU & FPGA Similar communication function

calls between CPU & FPGA

Application Mappers (continued) Computational parallelism

Pipelining of loops for(), while(), etc.

Use of library functions HDL coded functions called at HLL FFT, Floating point operations

Replication of functions defined in hardware

Types of communication DMA transfers

Efficient transfer of large chucks of data

Stream transfers Steady flow of data Buffered for transfer rate

changes

Introduction to the F2 project Goals for performance analysis in RC

Productively identify and remedy performance bottlenecks in RC applications (CPUs and FPGAs)

Motivations Complex systems are difficult to analyze by hand

Manual instrumentation is unwieldy Difficult to make sense of large volume of raw data

Tools can help quickly locate performance problems Collect and view performance data with little effort Analyze performance data to indicate potential bottlenecks Staple in HPC, limited in HPEC, and virtually non-existent in RC

Challenges How do we expand notion of software performance analysis into

software-hardware realm of RC? What are common bottlenecks for dual-paradigm applications? What techniques are necessary to detect performance bottlenecks? How do we analyze and present these bottlenecks to a user?

Historical Background Gettimeofday and printf

VERY cumbersome, repetitive, manual, not optimized for speed Profilers date back to 70’s with “prof” (gprof, 1982)

Provide user with information about application behavior Percentage of time spent in a function How often a function calls another function

Simulators / Emulators Too slow or too inaccurate Require significant development time

PAPI (Performance Application Programming Interface) Portable interface to hardware

performance counters on modern CPUs Provides information about caches,

CPU functional units, main memory, and more

Processor HW counters

UltraSparc II 2

Pentium 3 2

AMD Athlon 4

IA-64 4

POWER4 8

Pentium 4 18* Source: Wikipedia

7

Performance Analysis Today

Original Application

Instrument

Execute

Measure

Analyze (Automatically)Present

Optimize

Measured Data File

Execution Environment

Visualizations

Instrumented Application

Potential Bottlenecks

Analyze (Manually)

Modified Application

Optimized Application

What does performance analysis look like today? Goals

Low impact on application behavior

High-fidelity performance data Flexible Portable Automated Concise Visualization

Techniques Event-based, sample-based Profile, Trace

Above all, we want to understand application behavior in order to locate performance problems!

Related Research and Tools: Parallel Performance Wizard (PPW) Open-source tool developed by UPC Group at

University of Florida Performance analysis and optimization (PGAS*

systems and MPI support) Performance data can be analyzed for

bottlenecks Offers several ways of exploring performance

data Graphs and charts to quickly view high-level

performance information at a glance [right, top] In-depth execution statistics for identifying

communication and computational bottlenecks Interacts with popular trace viewers (e.g.

Jumpshot [right, bottom]) for detailed analysis of trace data

Comprehensive support for correlating performance back to original source code

* Partitioned Global Address Space languages allow partitioned memory to be treated as global shared memory by software.

Motivation for RC Performance Analysis Dual-paradigm applications gaining more

traction in HPC and HPEC Design flexibility allows best use of FPGAs

and traditional processors Drawback: More challenging to design

applications for dual-paradigm systems Parallel application tuning and FPGA core

debugging are hard enough!

Debug

Performance

Debug

Performance

Debug

Performance

Sequential

Parallel

Dual-Paradigm

Difficultylevel

Less

More

No existing holistic solutions for analyzing dual-paradigm applications Software-only views leave out low-level details Hardware-only views provide incomplete

performance information Need complete system view for effective tuning

of entire application

Motivation for RC Performance Analysis

Q: Is my runtime load-balancing strategy working? A: ???

ChipScope waveform

Motivation for RC Performance Analysis Q: How well is my core’s pipelining strategy working?

A: ???

gprof output (×N, one for each node!)

Flat profile:

Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 51.52 2.55 2.55 5 510.04 510.04 USURP_Reg_poll 29.41 4.01 1.46 34 42.82 42.82 USURP_DMA_write 11.97 4.60 0.59 14 42.31 42.31 USURP_DMA_read 4.06 4.80 0.20 1 200.80 200.80 USURP_Finalize 2.23 4.91 0.11 5 22.09 22.09 localp 1.22 4.97 0.06 5 12.05 12.05 USURP_Load 0.00 4.97 0.00 10 0.00 0.00 USURP_Reg_write 0.00 4.97 0.00 5 0.00 0.00 USURP_Set_clk 0.00 4.97 0.00 5 0.00 931.73 rcwork 0.00 4.97 0.00 1 0.00 0.00 USURP_Init

Instrumentation Level

HighLevel

Language

HardwareDescriptionLanguage

FPGABit

Stream

Instrumentation Level

Requires timeconsuming

place and route

Requires FPGA specific

instrumentation

Application mapping

disturbed by instrumentation

Constrained to HLL timing functions

(if any)XXX X

X

Dis

adva

nta

ge

High-level language (HLL) Requires HLL timing functions Application mapping disturbed

by instrumentation Hardware Description

Language (HDL) Portable between HLL and

types FPGA families Selected level for

instrumentation FPGA bit stream

Requires targeting specific FPGA family

Instrument in minutes

Instrumentation Selection Automated - Computation

State machines Used for preserving execution order in C functions Used to control state of pipelines

Control and status signals Used by library function

Automated - Communication Control and status signals

Used for streaming communication Used for DMA transfers

Application specific Monitoring variables for meaningful values

Measurement Techniques Profiling

Counters Records number of occurrences of event

Low overhead Normally uses registers Block RAM can be used for state machines

Tracing Timestamps

Indicating when event occurred Data

Associated with each event Greater overhead

Uses memory to store timestamps and data Greater fidelity

Reconstruction of sequence of events

*

* Zaki, O., Lusk, E., Gropp, W., and Swider, D. 1999. Toward Scalable Performance Visualization with Jumpshot. Int. J. High Perform. Comput. Appl. 13, 3 (Aug. 1999), 277-288.

CPU-0

3

2

1

Time

Hardware Measurement Module

TriggersSignal

Analysis Module

SignalsData

Profile Counters0 1 2 P - 1

Trace Data

Trace Data

Trace Data

Cycle Counter

Module Statistics

Module Control

Request

Perf. Data

Sample Control

signal

value

comp trigger

On-chip

Onboard

data

...

Offboard

Mem

ory

Mo

du

le

Uninstrumented ProjectInstrumentation added to C sourceC source for FPGA mapped to HDLInstrumentation added to HDLImplement hardware

Adding Instrumentation & Measurement

HLL Hardware Wrapper

HLL API Wrapper

Application(C source)

MeasurementExtraction

Process/Thread

Application(C source)

FPGA(s)

CPU(s) HLL Tool Flow

C source

Compilesoftware

Implementhardware

Finisheddesign

Software-hardwaremapping

InstrumentedSignals

Instrumentation

Instrumentation

Loopback(C source)

Application(HDL) Loopback

(HDL)

HardwareMeasurement

Module

Reverse Mapping & Analysis

Mapping of HDL data back to HLL Variable name-matching Observing scope and other patterns

Bottleneck detection Load-balancing of replicated functions Monitoring for pipeline stalls Detecting streaming communication stalls Finding shared-memory contention

Example RC Visualization Need unified visualizations that accentuate

important statistics Must be scalable to many nodes

CPU 3CPU 2CPU 1CPU 0

904MB/s88%

10MB/s1%

812MB/s79%

914MB/s89%

1.79GB/s72%

6MB/s0%

2.50GB/s100%

0MB/s0%

0MB/s0%

FPGA 0 FPGA 1

2.76GB/s69%

CPU 4 CPU 5

1.98GB/s99%

Network

210MB/s10%

FPGA 2

Throughput (MB/s)

Time (sec)

IDLE75%

PHASE 19%

PHASE 216%

Potential Bottlenecks CPU Interconnect

121MB/s12%

691MB/s67%

Molecular Dynamics Simulation

Interactions between atoms and molecules discrete time intervals

Models forces Newtonian physics Van Der Walls forces Other interactions

Tracks molecules position and velocity X, Y and Z directions

http://en.wikipedia.org/wiki/Molecular_dynamics

Case Study Setup Impulse C v2.2 XD1000 platform

Opteron 2.2 GHz XD1000 module with Altera

Stratix-II EP2S180 FPGA in second processor socket

MD communication architecture Chunks of MD data are read

from SRAM Data is streamed to multiple

MD kernels that are pipelined Results are stored back to

SRAM

Input Memory Access

SRAMStorage

Output Memory AccessCollector

Output Stream 1 Output Steam 16

MD kernel 16

Input Stream 16

MD kernel 1

Input Steam 1

Distributor

Impulse-C Profile Percentages

Output stream of Molecular Dynamics kernel is a bottleneck.

void MD_kernel (co_stream in, co_stream out){

…for(t=0;t<16384;t++){

co_stream_read(in, &x, …);co_stream_read(in, &y, …);co_stream_read(in, &z, …);

…for(i=0;i<1024;i++){//Perform MD calculations

#pragma CO PIPELINE

…}co_stream_write(out, &x, …);co_stream_write(out, &y, …);co_stream_write(out, &z, …);

}

…}

51.67%

0.09%

0.26%

47.98%

Percent Runtime

(not shown)

Stream buffer size was increased by 32 times allowing application speedup to increase from 6.2 to 7.8 vs. serial baseline.

MD Kernal Runtime

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

128 256 512 1024 2048 4096

Stream buffer size (bytes)

FP

GA

ru

nti

me

(sec

on

ds)

Other

Output stream

Pipeline

Performance Analysis Overhead

EP2S180 Original Modified Difference

Logic Used(143520)

126252(87.97%)

131851(91.87%)

+5599(+3.90%)

Comb. ALUT(143520)

100344 (69.92%)

104262 (72.65%)

+3918 (+2.73%)

Registers(143520)

104882 (73.08%)

110188 (76.78%)

+5306 (+3.70%)

Block RAM (9383040 bits)

3437568 (36.64%)

3557376 (37.91%)

+119808 (+1.27%)

Frequency (MHz)

80.57 78.44 -2.13(-2.64%)

Additional FPGA resource usage Less than 4%

Frequency reduction Less than 3%

Conclusions Developed prototype HLL-oriented RC

performance analysis tool First such runtime performance analysis tool framework

(per extensive literature review) Tracing & profiling available Automated instrumentation in progress

Application case study performed Observed minimal overhead from tool Speedup achieved due to performance analysis

Future work SRC support, automated instrumentation and analysis,

integration with software PAT, further case studies

References Paul Graham, Brent Nelson, and Brad Hutchings. Instrumenting bitstreams for debugging

FPGA circuits. In Proc. of the the 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 41-50, Washington, DC, USA, Apr. 2001. IEEE Computer Society.

Sameer S. Shende and Allen D. Malony. The Tau parallel performance system. International Journal of High Performance Computing Applications (HPCA), 20(2):287-311, May 2006.

C. EricWu, Anthony Bolmarcich, Marc Snir, DavidWootton, Farid Parpia, Anthony Chan, Ewing Lusk, and William Gropp. From trace generation to visualization: a performance framework for distributed parallel systems. In Proc. of the 2000 ACM/IEEE conference on Supercomputing (CDROM) (SC), page 50, Washington, DC, USA, Nov. 2000. IEEE Computer Society.

Adam Leko and Max Billingsley, III. Parallel performance wizard user manual. http://ppw.hcs.ufl.edu/docs/pdf/manual.pdf, 2007.

S. Koehler, J. Curreri, and A. George, "Challenges for Performance Analysis in High-Performance Reconfigurable Computing," Proc. of Reconfigurable Systems Summer Institute 2007 (RSSI), Urbana, IL, July 17-20, 2007.

J. Curreri, S. Koehler, B. Holland, and A. George, "Performance Analysis with High-Level Languages for High-Performance Reconfigurable Computing," Proc. of 16th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Palo Alto, CA, Apr. 14-15, 2008.

Documents

John Curreri Seth Koehler Rafael Garcia Project F2: Application Performance Analysis