62
MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: Rodric Rabbah, 6.189 Multicore Programming Primer, January (IAP) 2007. (Massachusetts Institute of Technology: MIT OpenCourseWare). http://ocw.mit.edu (accessed MM DD, YYYY). License: Creative Commons Attribution-Noncommercial-Share Alike. Note: Please use the actual date you accessed this material in your citation. For more information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms

MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

MIT OpenCourseWare http://ocw.mit.edu

6.189 Multicore Programming Primer, January (IAP) 2007

Please use the following citation format:

Rodric Rabbah, 6.189 Multicore Programming Primer, January (IAP) 2007. (Massachusetts Institute of Technology: MIT OpenCourseWare). http://ocw.mit.edu (accessed MM DD, YYYY). License: Creative Commons Attribution-Noncommercial-Share Alike.

Note: Please use the actual date you accessed this material in your citation.

For more information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms

Page 2: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

6.189 IAP 2007

Lecture 10

Performance Monitoring and Optimizations

1 Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 6.189 IAP 2007 MIT

Page 3: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Review: Keys to Parallel Performance

● Coverage or extent of parallelism in algorithm � Amdahl’s Law

● Granularity of partitioning among processors � Communication cost and load balancing

● Locality of computation and communication � Communication between processors or between

processors and their memories

6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 2

Page 4: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Communication Cost Model

total data sent number of messages

C = f ∗ (o + l + n / m

+ t )overlap−B

frequency cost induced by amount of latency of messages contention per hidden by concurrency

message with computation overhead per

message (at both ends) bandwidth along path

(determined by network)network delay per message

3 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 5: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Overlapping Communication with Computation

Get Data Memory is idle

CPU is idle Compute

synchronization point

Get Data

Compute

4 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 6: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Limits in Pipelining Communication

● Computation to communication ratio limits performance gains from pipelining

Get Data

Compute

Get Data

Compute

● Where else to look for performance?

5 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 7: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Artifactual Communication

● Determined by program implementation and interactions with the architecture

● Examples: � Poor distribution of data across distributed memories � Unnecessarily fetching data that is not used � Redundant data fetches

6 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 8: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Lessons From Uniprocessors

● In uniprocessors, CPU communicates with memory

● Loads and stores are to uniprocessors as _______ and ______ are to distributed memory multiprocessors

“get” “put”

● How is communication overlap enhanced in uniprocessors? � Spatial locality � Temporal locality

7 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 9: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Spatial Locality

● CPU asks for data at address 1000

● Memory sends data at address 1000 … 1064� Amount of data sent depends on architecture

parameters such as the cache block size

● Works well if CPU actually ends up using data from 1001, 1002, …, 1064

● Otherwise wasted bandwidth and cache capacity

8 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 10: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Temporal Locality

● Main memory access is expensive ● Memory hierarchy adds small but fast memories

(caches) near the CPU � Memories get bigger as distance

from CPU increases

main memory

cache (level 2)

cache (level 1)

● CPU asks for data at address 1000 ● Memory hierarchy anticipates more accesses to same

address and stores a local copy

● Works well if CPU actually ends up using data from 1000 over and over and over …

● Otherwise wasted cache capacity

6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 9

Page 11: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Reducing Artifactual Costs in Distributed Memory Architectures

● Data is transferred in chunks to amortize communication cost � Cell: DMA gets up to 16K � Usually get a contiguous chunk of memory

● Spatial locality � Computation should exhibit good spatial locality

characteristics

● Temporal locality � Reorder computation to maximize use of data fetched

10 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 12: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

6.189 IAP 2007

Single Thread Performance: the last frontier in the search for

performance?

11 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 13: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Single Thread Performance

● Tasks mapped to execution units (threads) ● Threads run on individual processors (cores)

sequential sequential

parallel parallel

finish line: sequential time + longest parallel time

● Two keys to faster execution � Load balance the work among the processors � Make execution on each processor faster

12 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 14: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Understanding Performance

● Need some way of measuring performance � Coarse grained

measurements% gcc sample.c% time a.out 2.312u 0.062s 0:02.50 94.8% % gcc sample.c –O3% time a.out 1.921u 0.093s 0:02.03 99.0%

� … but did we learn much about what’s going on?

#define N (1 << 23)#define T (10)#include <string.h>double a[N],b[N];

void cleara(double a[N]) {int i;for (i = 0; i < N; i++) {

a[i] = 0;}

}int main() {

double s=0,s2=0; int i,j;for (j = 0; j < T; j++) {

for (i = 0; i < N; i++) {b[i] = 0;

}cleara(a);memset(a,0,sizeof(a));

record start time

for (i = 0; i < N; i++) {s += a[i] * b[i];s2 += a[i] * a[i] + b[i] * b[i];

}

}record stop time

printf("s %f s2 %f\n",s,s2);}

13 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 15: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Measurements Using Counters

● Increasingly possible to get accurate measurements using performance counters � Special registers in the hardware to measure events

● Insert code to start, read, and stop counter � Measure exactly what you want, anywhere you want � Can measure communication and computation duration � But requires manual changes � Monitoring nested scopes is an issue � Heisenberg effect: counters can perturb execution time

clear/start stop

time

14 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 16: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Dynamic Profiling

● Event-based profiling � Interrupt execution when an event counter reaches a

threshold

● Time-based profiling � Interrupt execution every t seconds

● Works without modifying your code � Does not require that you know where problem might be � Supports multiple languages and programming models � Quite efficient for appropriate sampling frequencies

15 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 17: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Counter Examples

● Cycles (clock ticks) ● Pipeline stalls ● Cache hits ● Cache misses ● Number of instructions ● Number of loads ● Number of stores ● Number of floating point operations

● …

16 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 18: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Useful Derived Measurements

● Processor utilization � Cycles / Wall Clock Time

● Instructions per cycle � Instructions / Cycles

● Instructions per memory operation � Instructions / Loads + Stores

● Average number of instructions per load miss � Instructions / L1 Load Misses

● Memory traffic � Loads + Stores * Lk Cache Line Size

● Bandwidth consumed � Loads + Stores * Lk Cache Line Size / Wall Clock Time

● Many others � Cache miss rate � Branch misprediction rate � …

17 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 19: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

applicationsource

run(profiles

execution)

performanceprofile

binaryobject code

interpret profile

binary analysissource

correlation

Common Profiling Workflow

application source

run (profiles

execution)

performance profile

binary object code

compiler

interpret profile

binary analysis source

correlation

18 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 20: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Popular Runtime Profiling Tools

● GNU gprof � Widely available with UNIX/Linux distributions

gcc –O2 –pg foo.c –o foo./foogprof foo

● HPC Toolkit � http://www.hipersoft.rice.edu/hpctoolkit/

● PAPI � http://icl.cs.utk.edu/papi/

● VTune � http://www.intel.com/cd/software/products/asmo-na/eng/vtune/

● Many others

19 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 21: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

GNU gprof

MPEG-2 decoder (reference implementation) %./mpeg2decode -b mei16v2.m2v -f -r

-r uses double precision inverse DCT % cumulative self self total time seconds seconds calls ns/call ns/call name 90.48 0.19 0 .19 7920 23989.90 23989.90 Reference IDCT

4.76 0.20 0 . 0 1 2148 4655.49 4655.49 Decode-GG~-Intra-Block

%./mpeg2decode -b mei16v2.m2v -f

uses fast integer based inverse DCT instead 66.67 0.02 0.02 8238 2427.77 2427.77 form component -prediction 33.33 0.03 0 . 0 1 63360 157.83 157.83 idctcol

Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 22: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

HPC Toolkit% hpcrun -e PAPI_TOT_CYC:499997 -e PAPI_L1_LDM \

-e PAPI_FP_INS -e PAPI_TOT_INS mpeg2decode -- …

Profile the “total cycles” using period 499997

Profile the “floating point instructions” using default period

Profile the “L1 data cache load misses” using the default period

Profile the “total instructions” using the default period

Running this command on a machine “sloth” produced a data file mpeg2dec.PAPI_TOT_CYC-etc.sloth.1234

21 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 23: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Interpreting a Profile

% hpcprof -e mpeg2dec mped2dec.PAPI_TOT_CYC-etc.sloth.1234Columns correspond to the following events [event:period (events/sample)]

PAPI_TOT_CYC:499997 - Total cycles (698 samples)

PAPI_L1_LDM:32767 - Level 1 load misses (27 samples)

PAPI_FP_INS:32767 - Floating point instructions (789 samples)

PAPI_TOT_INS:32767 - Instructions completed (4018 samples)

Load Module Summary:

91.7% 74.1% 100.0% 84.1% /home/guru/mpeg2decode

8.3% 25.9% 0.0% 15.9% /lib/libc-2.2.4.so

Function Summary:

...

Line summary:

...

22 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 24: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

1 # d e f i n e N (1 << 23) 2 # d e f i n e T (10) 3 # i n c l u d e < s t r i n g . h > 4 d o u b l e a [ N ] ,b [ N ] ; 5 void cleara (doub le a [ N ] ) { 6 i n t i; 7 f o r (i = 0; i < N ; i++){ 8 a [ i ] = 0 ; 9 1

1 0 1 11 i n t m a i n ( ) { 1 2 d o u b l e s=0 , s2=0 ; i n t i ,j ; 13 f o r ( j = 0 ; j < T ; j++) { 1 4 f o r (i = 0 ; i < N ; i++){ 1 5 b [ i ] = 0 ; 1 6 1 1 7 cleara (a); 1 8 memse t (a ,O , s i zeo f ( a ) ); 1 9 f o r (i = 0; i < N ; i++){ 20 s += a [ i ] * b [ i ] ; 21 s 2 += a [ i ] *a [ i ] + b [ i ] * b [ i ] ; 22 1 23 1 24 p r i n t f ("s %ds 2 %d\nVV, s ,s 2 ) ; 25

hpcprof Annotated Source File

1

23 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 25: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

hpcviewer Screenshot

24 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 26: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Performance in Uniprocessors time = compute + wait

● Instruction level parallelism � Multiple functional units, deeply pipelined, speculation, ...

● Data level parallelism � SIMD: short vector instructions (multimedia extensions)

– Hardware is simpler, no heavily ported register files – Instructions are more compact – Reduces instruction fetch bandwidth

● Complex memory hierarchies � Multiple level caches, may outstanding misses,

prefetching, …

5 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 2

Page 27: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

SIMD

● Single Instruction, Multiple Data ● SIMD registers hold short vectors ● Instruction operates on all elements in SIMD register at once

Scalar code Vector code for (int i = 0; i < n; i+=1) { for (int i = 0; i < n; i += 4) {

c[i] = a[i] + b[i] c[i:i+3] = a[i:i+3] + b[i:i+3]} }

a a

b b

c

scalar register SIMD register

c

6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 26

Page 28: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

SIMD Example

for (int i = 0; i < n; i+=1) {c[i] = a[i] + b[i]

}

LOAD LOAD

ADD

STORE

b[i]a[i]

c[i]Cycle Slot 1 Slot 2

1 LOAD LOAD

2 ADD

3 STORE

… … …

Estimated cycles for loop: ● Scalar loop

n iterations ∗ 3 cycles/iteration

27 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 29: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

SIMD Example

for (int i = 0; i < n; i+=4) {c[i:i+3] = a[i:i+3] + b[i:i+3]

}

VLOAD VLOAD

VADD

VSTORE

b[i:i+3]a[i:i+3]

c[i:i+3]Cycle Slot 1 Slot 2

1 VLOAD VLOAD

2 VADD

3 VSTORE

… … …

Estimated cycles for loop: ● Scalar loop

n iterations ∗ 3 cycles/iteration ● SIMD loop

n/4 iterations ∗ 3 cycles/iteration ● Speedup: ?

28 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 30: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

SIMD Example

for (int i = 0; i < n; i+=4) {c[i:i+3] = a[i:i+3] + b[i:i+3]

}

VLOAD VLOAD

VADD

VSTORE

b[i:i+3]a[i:i+3]

c[i:i+3]Cycle Slot 1 Slot 2

1 VLOAD VLOAD

2 VADD

3 VSTORE

… … …

Estimated cycles for loop: ● Scalar loop

n iterations ∗ 3 cycles/iteration ● SIMD loop

n/4 iterations ∗ 3 cycles/iteration ● Speedup: 4x

29 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 31: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

SIMD in Major ISAs

Instruction Set Architecture SIMD Width Floating Point AltiVec MMX/SSE 3DNow! VIS MAX2 MVI MDMX

PowerPC 128 yes Intel 64/128 yes AMD 64 yes Sun 64 no HP 64 no

Alpha 64 no MIPS V 64 yes

● And of course Cell � SPU has 128 128-bit registers � All instructions are SIMD instructions � Registers are treated as short vectors of 8/16/32-bit

integers or single/double-precision floats

30 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 32: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Using SIMD Instructions

● Library calls and inline assembly � Difficult to program � Not portable

● Different extensions to the same ISA � MMX and SSE � SSE vs. 3DNow!

● You’ll get first hand-experience experience with Cell

31 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 33: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Superword Level Parallelism (SLP)

● Small amount of parallelism � Typically 2 to 8-way

exists within basic blocks

● Uncovered with simple analysis

Samuel Larsen. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. Master's thesis, Massachusetts Institute of Technology, May 2000.

6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 32

Page 34: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

1. Independent ALU Ops

R = R + XR * 1.08327G = G + XG * 1.89234B = B + XB * 1.29835

R R XR 1.08327 G = G + XG * 1.89234 B B XB 1.29835

33 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 35: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

2. Adjacent Memory References

R = R + X[i+0]G = G + X[i+1]B = B + X[i+2]

R R G = G + X[i:i+2]B B

34 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 36: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

3. Vectorizable Loops

for (i=0; i<100; i+=1)A[i+0] = A[i+0] + B[i+0]

35 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 37: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

3. Vectorizable Loops

for (i=0; i<100; i+=4)A[i+0] = A[i+0] + B[i+0]A[i+1] = A[i+1] + B[i+1]A[i+2] = A[i+2] + B[i+2]A[i+3] = A[i+3] + B[i+3]

for (i=0; i<100; i+=4)

A[i:i+3] = B[i:i+3] + C[i:i+3]

36 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 38: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

4. Partially Vectorizable Loops

for (i=0; i<16; i+=1)L = A[i+0] – B[i+0]D = D + abs(L)

37 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 39: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

4. Partially Vectorizable Loops

for (i=0; i<16; i+=2)L = A[i+0] – B[i+0]D = D + abs(L)L = A[i+1] – B[i+1]D = D + abs(L)

for (i=0; i<16; i+=2)

L0L1

= A[i:i+1] – B[i:i+1]

D = D + abs(L0)D = D + abs(L1)

38 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 40: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

From SLP to SIMD Execution

● Benefits � Multiple ALU ops → One SIMD op � Multiple load and store ops → One wide memory op

● Cost � Packing and unpacking vector register � Reshuffling within a register

39 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 41: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Packing and Unpacking Costs

● Packing source operands ● Unpacking destination operands

A = f()B = g()C = A + 2 D = B + 3 E = C / 5F = D * 7

A B

AB

C D = A B +

C D

C D

2 3

40 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 42: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Packing Costs can be Amortized

● Use packed result operands ● Share packed source operands

A = B + C D = E + F

G = B + H I = E + J

A = B + C D = E + F

G = A - H I = D - J

41 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 43: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Adjacent Memory is Key

● Large potential performance gains � Eliminate load/store instructions � Reduce memory bandwidth

● Few packing possibilities � Only one ordering exploits pre-packing

42 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 44: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

SLP Extraction Algorithm

● Identify adjacent memory references

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

43 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 45: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

SLP Extraction Algorithm

● Identify adjacent memory references

A = X[i+0]C = E * 3

A B = X[i:i+1]

B = X[i+1]H = C – AD = F * 5J = D - B

44 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 46: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

SLP Extraction Algorithm

● Follow operand use

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

A B = X[i:i+1]

45 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 47: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

SLP Extraction Algorithm

● Follow operand use

A = X[i+0]A

B = X[i+1]C = E * 3

B = X[i:i+1]

H = C – A D = F * 5 J = D - B H C

J = D -A B

46 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 48: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

SLP Extraction Algorithm

● Follow operand use

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

A B = X[i:i+1]

DJ C

= -H A

B

47 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 49: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

SLP Extraction Algorithm

● Follow operand use

A = X[i+0]C = E * 3

C D -

D = F * 5

B = X[i+1]

HJ

H = C – A

J = D - B

A B = X[i:i+1]

C D =

E F *

=

3 5

A B

8 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 4

Page 50: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

SLP Extraction Algorithm

● Follow operand use

A = X[i+0]C = E * 3

A B = X[i:i+1]

B = X[i+1] C D

H J

= E F

C D

*H = C – AD = F * 5J = D - B

= -

3 5

A B

49 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 51: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

swim

tomcatv

mgrid

su2c

or wave

5apsi

hydro2d

tur

b3dapplu fpppp

FIR

IIRVMMMMM

YUV

SLP Availability

0 10 20 30 40 50 60 70 80 90

100 % dynamic SUIF instructions eliminated

128 bits 1024 bits

50 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 52: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

2

swim

tomcatv FIR

IIR

VMM

MMM

YUV

Speedup on AltiVec

1 1. 1 1. 2 1. 3 1. 4 1. 5 1. 6 1. 7 1. 8 1. 9

6.7

51 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 53: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Performance in Uniprocessors time = compute + wait

● Instruction level parallelism � Multiple functional units, deeply pipelined, speculation, ...

● Data level parallelism � SIMD: short vector instructions (multimedia extensions)

– Hardware is simpler, no heavily ported register files – Instructions are more compact – Reduces instruction fetch bandwidth

● Complex memory hierarchies � Multiple level caches, may outstanding misses,

prefetching, …

� Exploiting locality is essential

52 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 54: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Instruction Localitycache miss cache hit

A

B

C

cache size

Working Set Size

for i = 1 to N A(); B(); C();

end

Baseline

miss rate = 1A B C

+ +

53 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 55: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Instruction Localitycache miss cache hit

A

B

C

for i = 1 to N A(); B(); C();

end

for i = 1 to N A();

for i = 1 to N B();

for i = 1 to N C();

Working Set Size

cache size

A B C

A B C

+ +

Full ScalingBaseline

miss rate = 1 / Nmiss rate = 1

54 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 56: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Example Memory (Cache) Optimization

A

B

C

cache size

Working Set Size

for i = 1 to N A(); B(); C();

end

for i = 1 to N A();

for i = 1 to N B();

for i = 1 to N C();

inst A B C

datainst data

A B C

+ + C

B C B

B A

B A

Full Scaling Baseline

55 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 57: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Example Memory (Cache) Optimization

A

B

C

cache size

Working Set Size

for i = 1 to N A();

for i = 1 to N B();

for i = 1 to N C();

for i = 1 to N A(); B();

end for i = 1 to N

C();

inst A B C

datainst data

A B C

+ +

inst data C

A B+C

B C B

B A

B A

B A

Full Scaling Baseline

C B

for i = 1 to N A(); B(); C();

end

56 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 58: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

for j = 1 to N/64

Example Memory (Cache) Optimization

A

B

C

cache size

Working Set Size

end

for i = 1 to 64 A(); B();

end for i = 1 to 64

C();

for i = 1 to N A();

for i = 1 to N B();

for i = 1 to N C();

inst A B C

datainst data

A B C

+ +

inst data C

A B+C

B C B

B A

B A

B A

Full Scaling Baseline Cache Aware

C B

for i = 1 to N A(); B(); C();

end

57 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 59: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Results from Cache Optimizations

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

ignoring cache constraints cache aware

Janis Sermulins. Cache Aware Optimizations of Stream Programs. Master's thesis, Massachusetts Institute of Technology, May 2005.

StrongARM 1110 Pentium 3 Itanium 2

8 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 5

Page 60: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

6.189 IAP 2007

Summary

59 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 61: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Programming for Performance

● Tune the parallelism first

● Then tune performance on individual processors � Modern processors are complex � Need instruction level parallelism for performance � Understanding performance requires a lot of probing

● Optimize for the memory hierarchy � Memory is much slower than processors � Multi-layer memory hierarchies try to hide the speed gap � Data locality is essential for performance

60 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007

Page 62: MIT OpenCourseWare 6.189 Multicore Programming Primer ......Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) sequential

Programming for Performance

● May have to change everything! � Algorithms, data structures, program structure

● Focus on the biggest performance impediments� Too many issues to study everything � Remember the law of diminishing returns

61 6.189 IAP 2007 MIT Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007