36
Memory Systems Performance Workshop 2 004 © David Ryan Koes 2004 1 MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani Seth Copen Goldstein

Memory Systems Performance Workshop 2004© David Ryan Koes 20041 MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 1

MSP 2004

Programmer SpecifiedPointer Independence

David KoesMihai Budiu

Girish VenkataramaniSeth Copen Goldstein

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 2

Outline

• Motivation• #pragma independent• Automated Annotation• Evaluation• Conclusion

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 3

Problem

Potentially aliasing pointers inhibit compiler optimization.

Fully determining pointer aliasing may be infeasible or expensive.

How to get the benefit without paying the cost?

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 4

Memory Dependencies

Memory dependencies inhibit optimization• Introduce edges into dependence graph• Limits parallelization• Inhibits code motion

– instruction scheduling– loop invariant code motion– partial redundancy elimination– register promotion

Breaking memory dependencies difficult• compile-time analysis infeasible or expensive• run-time analysis limited to local window

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 5

Examples

while(len--){ *p++ = *q++;}

There is a real data dependence between the load and store within a single iteration.

Unroll loop to exploit parallelism

.L26: mov r24 = r33 mov r17 = r32 adds r22 = 8, r33 adds r19 = 8, r32 adds r20 = 12, r33 adds r21 = 12, r32 ;; ld4 r14 = [r24], 4 adds r33 = 16, r33 adds r32 = 16, r32 ;; st4 [r17] = r14, 4 ld4 r23 = [r24] ;; st4 [r17] = r23 ld4 r18 = [r22] ;; st4 [r19] = r18 ld4 r16 = [r20] ;; st4 [r21] = r16 br.cloop .L26 ;;

Itanium assembly from gcc

.L26: mov r18 = r33 mov r23 = r32 adds r25 = 8, r33 adds r24 = 12, r33 adds r22 = 8, r32 adds r21 = 12, r32 ;; ld4 r14 = [r18], 4 ld4 r19 = [r25] adds r33 = 16, r33 adds r32 = 16, r32 ;; st4 [r23] = r14, 4 ld4 r16 = [r18] ld4 r20 = [r24] ;; .mmb st4 [r23] = r16 st4 [r22] = r19 st4 [r21] = r20 br.cloop .L26 ;;

without memory dependence

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 6

Examples

for(i = 0; i < len; i++){ ... ... = *q; ... *p = ...}

t0 = *q;for(i = 0; i < len; i++){ ... ... = t0; ... t1 = ...}*p = t1; if loop was executed

t0 = *q; if loop will be executed

for(i = 0; i < len; i++){ ... ... = t0; ... *p = ...}

loop invariantcode motion

register

promotion

Hardware can’t do this

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 7

Pointer Analysis

Memory Disambiguation is important• hardware can’t do everything• so have compiler figure it out...

int p[10];foo(){ int q[10]; ...}

foo(){ int *p, *q; int a,b; if(...) { p = &a; q = &b; } else { p = &b; q = &a; } ...}

foo(int *p, int *q){ ...}

easy!

harder.. need precisedataflow analysis

requiresinter-procedural

information

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 8

Inter-procedural Pointer Analysis

• Just apply same techniques as used for intraprocedural• may not be possible

– gcc -c foo.c• may not be feasible

– n2 analysis on source code of Microsoft Office?

• Use less precise analysis• still might not be possible (separate compilation, libraries)• still takes time (every time you compile, or at least link)• less precise » less optimization

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 9

Alternative: Have Programmer Do It

Programmer annotates source code • informs compiler of pointer relationships

Previous Work • ANSI C99 restrict keyword

– difficult for compiler and programmer to reason about– non-local semantics

• MIPSpro #pragma ivdep– break loop carried dependence in inner loop

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 10

Outline

• Motivation• #pragma independent• Automated Annotation• Evaluation• Conclusion

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 11

#pragma independent

Syntax#pragma independent ptr1 ptr2

Example

int x[100]int y;

void foo(int *a, int *b){ #pragma independent a b int arr[50]; …}

x

y malloc_site_1

arr

malloc_site_2

pointers guaranteed to always point to different objects

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 12

Examples

void f(int len, int * p, int * q){ #pragma independent p q while (len--) *p++ = *q++;}

void example(int *a, int *b, int *c){ #pragma independent a b #pragma independent a c (*b)++; *a = *b; *a = *a + *c;}

pragmas allow compiler to eliminate a store to *a

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 13

#pragma independent

Advantages• more flexible and powerful than restrict• relationships between pointers explicit• easy to reason about

– effects only listed pointers• easy to implement in compiler

– fewer than 100 lines of code

Possible Disadvantage• could take programmer a lot of time to annotate

existing source

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 14

Outline

• Motivation• #pragma independent• Automated Annotation• Evaluation• Conclusion

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 15

Automated Annotation Toolflow

*.c *.h compiler execution

script

pragma aware

compiler

programmer

executable withruntime checks

invalid pointer pairsexecution frequencies

candidate pointer pairsstatic scores

pragma annotations ranked by score

source code withverified pragmas

faster executable

Compiler finds interesting pointer pairs

• pairs which inhibit optimization

• pairs whose aliasing is unknown

Inserts profiling code and checks

inputs

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 16

Automated Annotation Toolflow

*.c *.h compiler execution

script

pragma aware

compiler

programmer

executable withruntime checks

invalid pointer pairsexecution frequencies

candidate pointer pairsstatic scores

pragma annotations ranked by score

source code withverified pragmas

faster executable

Instrumented executable run on input

• records pointers which conflict

• counts number of pointer uses

inputs

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 17

Automated Annotation Toolflow

*.c *.h compiler execution

script

pragma aware

compiler

programmer

executable withruntime checks

invalid pointer pairsexecution frequencies

candidate pointer pairsstatic scores

pragma annotations ranked by score

source code withverified pragmas

faster executable

Script combines static and dynamic info

• eliminates conflicting pairs

• assigns score to each pair

inputs

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 18

Automated Annotation Toolflow

*.c *.h compiler execution

script

pragma aware

compiler

programmer

executable withruntime checks

invalid pointer pairsexecution frequencies

candidate pointer pairsstatic scores

pragma annotations ranked by score

source code withverified pragmas

faster executable

Programmer verifies pointer pairs

• can verify high scoring pairs only

inputs

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 19

Example Output

void summer(int *p, int *q, int n, int *result){#pragma independent p q /* score: 1100 */#pragma independent p result /* score: 15 */#pragma independent q result /* score: 12 */

int i, sum = 0;for(i = 0; i < n; i++){

*p += *q;sum += *q;

}*result = sum;

}

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 20

Sample Score Distribution

0

50

100

150

200

250

300

350

400

Number of pairs

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

Percentile of maximum score

Dynamic ScoreStatic Score

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 21

Outline

• Motivation• #pragma independent• Automated Annotation• Evaluation• Conclusion

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 22

Targets & Benchmarks

Targets• Itanium

• EPIC/VLIW architecture• instruction scheduling important for good performance

• ASH (Application Specific Hardware)• can take full advantage of parallelism

Benchmarks• Mediabench

• small, multimedia applications• can’t time accurately on Itanium

• Spec95, Spec2000• general purpose integer• longer running

– sometimes days for ASH simulation

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 23

Compilers

Compilers• gcc

• not very sophisticated optimizations• -funroll-loops -O2

• CASH• more sophisticated optimizations• memory dependencies are first class objects

– token edge– pragma independent removes edge

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 24

Questions

Do we find a reasonable number of potential annotations?• Yes!

Do the annotations result in faster code?• Yes!

Does our scoring mechanism find the pointer pairs with the biggest impact on performance?• Yes!

How much time does the programmer have to spend verifying pragmas?• Not a lot!

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 25

Annotations Found

119

3

56

490

188

132

12 12

4132

0 0

36 36

453

94

72

34 34

3470

159

40

451

3 07 2

950

30

252

463744 418 979

0

50

100

150

200

250

300

124.m88ksim129.compress

130.li132.ijpeg134.perl175.vpr181.mcfadpcm_dadpcm_e

epic_depic_eg721_dg721_egsm_dgsm_ejpeg_djpeg_emesampeg2_dmpeg2_epegwit_dpegwit_e176.gcc

197.parser256.bzip2300.twolf168.wupwise

171.swim172.mgrid173.applu177.mesa183.equake188.ammp

301.apsi

Benchmark

Number of pointer pairs

uncheckedconflictno conflictuseful

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 26

Do the annotations result in faster code?

Of 19 Spec benchmarks, these were the only ones to demonstrate measurable speedup

Itanium Speedup

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 27

Do the annotations result in faster code?

CASH Speedup

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 28

Does our scoring mechanism work?

all (68)

Number of highest scoring pragmas

mpeg2_e

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 29

How much time does the programmer have to spend?

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 30

Verified Speedup

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 31

Conclusions

• We’ve performed a limit study of pointer analysis• gcc doesn’t fully exploit the results of pointer analysis• CASH and ASH can fully exploit parallelism

• Programmer specified annotations are effective• faster and more flexible than inter-procedural analysis

• Annotations can be automatically generated• automatic score successfully focuses programmer’s attention• manual verification does not take long

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 32

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 33

ANSI C99 restrict keyword

An object that is accessed through a restrict-qualified pointer has a special association with that pointer. This association, defined in 6.7.3.1 below, requires that all accesses to that object use, directly or indirectly, the value of that particular pointer.) The intended use of the restrict qualifier (like the register storage class) is to promote optimization, and deleting all instances of the qualifier from all preprocessing translation units composing a conforming program does not change its meaning (i.e., observable behavior).

ISO/IEC 9899Second edition

1999-12-01 6.7.3-7

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 34

restrict Example

void f(int len, int * restrict p, int * restrict q){ while (len--) *p++ = *q++;}

restrict tells the compiler that p and q refer to different objects, enabling optimizations

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 35

Problems with restrict

6.7.3.1

Memory Systems Performance Workshop 2004 © David Ryan Koes 2004 36

gcc’s restrict Implementation

• No two restricted pointers can alias• A restricted pointer and an unrestricted pointer may

alias

This definition is intuitive for both the programmer and compiler

But not the C99 definition!