58
Automatic Differentiation by Program Transformation Laurent Hasco¨ et INRIA Sophia-Antipolis, France http://www-sop.inria.fr/tropics Ecole d’´ et´ e CEA-EDF-INRIA, Juin 2006 Laurent Hasco¨ et (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 1 / 54

Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Automatic Differentiation

by Program Transformation

Laurent Hascoet

INRIA Sophia-Antipolis, Francehttp://www-sop.inria.fr/tropics

Ecole d’ete CEA-EDF-INRIA,Juin 2006

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 1 / 54

Page 2: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Outline

1 Introduction

2 Formalization

3 Reverse AD

4 Memory issues in Reverse AD: Checkpointing

5 Reverse AD for minimization

6 Some AD Tools

7 Static Analyses in AD tools

8 The TAPENADE AD tool

9 Validation of AD results

10 Conclusion

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 2 / 54

Page 3: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

So you need derivatives ?...

Given a program P computing a function F

F : IRm → IRn

X 7→ Y

we want to build a program that computes the derivativesof F .

Specifically, we want the derivatives of the dependent,i.e. some variables in Y ,with respect to the independent,i.e. some variables in X .

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 3 / 54

Page 4: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Divided Differences

Given X , run P twice, and compute Y

Y =P(X + εX )− P(X )

ε

Pros: immediate; no thinking required !

Cons: approximation; what ε ?⇒ Not so cheap after all !

Most applications require inexpensive and accuratederivatives.

⇒ Let’s go for exact, analytic derivatives !

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 4 / 54

Page 5: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Automatic Differentiation

Augment program P to make it compute the analyticderivatives

P: a = b*T(10) + c

The differentiated program must somehow compute:P’: da = db*T(10) + b*dT(10) + dc

How can we achieve this?

AD by Overloading

AD by Program transformation

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 5 / 54

Page 6: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

AD by overloading

Tools: adol-c, ...Few manipulations required:

DOUBLE → ADOUBLE ;

link with provided overloaded +,-,*,. . .

Easy extension to higher-order, Taylor series, intervals,. . . but not so easy for gradients.

Anecdote?:

real → complex

x = a*b →(x , dx) = (a*b-da*db , a*db+da*b)

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 6 / 54

Page 7: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

AD by Program transformation

Tools: adifor, taf, tapenade,...

Complex transformation required:

Build a new program that computes the analyticderivatives explicitly.Requires a compiler-like, sophisticated tool

1 PARSING,2 ANALYSIS,3 DIFFERENTIATION,4 REGENERATION

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 7 / 54

Page 8: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Overloading vs Transformation

Overloading is versatile,

Transformed programs are efficient:

Global program analyses are possible. . . and most welcome !

The compiler can optimize the generated program.

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 8 / 54

Page 9: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Example: Tangent differentiation

by Program transformation

SUBROUTINE FOO(v1, v2, v4, p1)

REAL v1,v2,v3,v4,p1

v3 = 2.0*v1 + 5.0

v4 = v3 + p1*v2/v3

END

v3d = 2.0*v1d

v4d = v3d + p1*(v2d*v3-v2*v3d)/(v3*v3)

REAL v1d,v2d,v3d,v4d

v1d, v2d, v4d,•

Just inserts “differentiated instructions” into FOO

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 9 / 54

Page 10: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Example: Tangent differentiation

by Program transformation

SUBROUTINE FOO(v1, v2, v4, p1)

REAL v1,v2,v3,v4,p1

v3 = 2.0*v1 + 5.0

v4 = v3 + p1*v2/v3

END

v3d = 2.0*v1d

v4d = v3d + p1*(v2d*v3-v2*v3d)/(v3*v3)

REAL v1d,v2d,v3d,v4d

v1d, v2d, v4d,•

Just inserts “differentiated instructions” into FOO

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 9 / 54

Page 11: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Example: Tangent differentiation

by Program transformation

SUBROUTINE FOO(v1, v2, v4, p1)

REAL v1,v2,v3,v4,p1

v3 = 2.0*v1 + 5.0

v4 = v3 + p1*v2/v3

END

v3d = 2.0*v1d

v4d = v3d + p1*(v2d*v3-v2*v3d)/(v3*v3)

REAL v1d,v2d,v3d,v4d

v1d, v2d, v4d,•

Just inserts “differentiated instructions” into FOOLaurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 9 / 54

Page 12: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Outline

1 Introduction

2 Formalization

3 Reverse AD

4 Memory issues in Reverse AD: Checkpointing

5 Reverse AD for minimization

6 Some AD Tools

7 Static Analyses in AD tools

8 The TAPENADE AD tool

9 Validation of AD results

10 Conclusion

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 10 / 54

Page 13: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Computer Programs as Functions

We see program P as:

f = fp ◦ fp−1 ◦ · · · ◦ f1

We define for short:

W0 = X and Wk = fk(Wk−1)

The chain rule yields:

f ′(X ) = f ′p(Wp−1).f′p−1(Wp−2). . . . .f

′1(W0)

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 11 / 54

Page 14: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Tangent mode and Reverse mode

Full f ′(X ) is expensive and often useless.We’d better compute useful “projections”.

tangent AD :

Y = f ′(X ).X = f ′p(Wp−1).f′p−1(Wp−2) . . . f ′1(W0).X

reverse AD :

X = f ′t(X ).Y = f ′t1 (W0). . . . f′tp−1(Wp−2).f

′tp (Wp−1).Y

Evaluate both from right to left:⇒ always matrix × vector

Theoretical cost is about 4 times the cost of P

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 12 / 54

Page 15: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Costs of Tangent and Reverse AD

F : IRm → IRn

( )[]m inputs

n outputs

Gradient

Tangent

f ′(X ) ∼ costs (m + 1?) ∗ P using Divided Differencesf ′(X ) costs m ∗ 4 ∗ P using the tangent modeGood if m <= nf ′(X ) costs n ∗ 4 ∗ P using the reverse modeGood if m >> n (e.g n = 1 in optimization)

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 13 / 54

Page 16: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Back to the Tangent Mode example

v3 = 2.0*v1 + 5.0v4 = v3 + p1*v2/v3

Elementary Jacobian matrices:

f ′(X ) = ...

1

11

0 p1

v31− p1∗v2

v23

0

11

2 01

...

v3 = 2 ∗ v1

v4 = v3 ∗ (1− p1 ∗ v2/v23 ) + v2 ∗ p1/v3

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 14 / 54

Page 17: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Tangent Mode example continued

Tangent AD keeps the structure of P :...

v3d = 2.0*v1dv3 = 2.0*v1 + 5.0v4d = v3d*(1-p1*v2/(v3*v3)) + v2d*p1/v3v4 = v3 + p1*v2/v3

...Differentiated instructionsinserted into P’s original control flow.

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 15 / 54

Page 18: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Outline

1 Introduction

2 Formalization

3 Reverse AD

4 Memory issues in Reverse AD: Checkpointing

5 Reverse AD for minimization

6 Some AD Tools

7 Static Analyses in AD tools

8 The TAPENADE AD tool

9 Validation of AD results

10 Conclusion

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 16 / 54

Page 19: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Focus on the Reverse mode

X = f ′t(X ).Y = f ′t1 (W0) . . . f ′tp (Wp−1).Y

Ip−1 ;W = Y ;W = f ′tp (Wp−1) * W ;

Ip−2 ;

Restore Wp−2 before Ip−2 ;W = f ′tp−1(Wp−2) * W ;

I1 ;...

...Restore W0 before I1 ;W = f ′t1 (W0) * W ;X = W ;

Instructions differentiated in the reverse order !

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 17 / 54

Page 20: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Focus on the Reverse mode

X = f ′t(X ).Y = f ′t1 (W0) . . . f ′tp (Wp−1).Y

Ip−1 ;W = Y ;W = f ′tp (Wp−1) * W ;

Ip−2 ;

Restore Wp−2 before Ip−2 ;W = f ′tp−1(Wp−2) * W ;

I1 ;...

...Restore W0 before I1 ;W = f ′t1 (W0) * W ;X = W ;

Instructions differentiated in the reverse order !

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 17 / 54

Page 21: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Focus on the Reverse mode

X = f ′t(X ).Y = f ′t1 (W0) . . . f ′tp (Wp−1).Y

Ip−1 ;W = Y ;W = f ′tp (Wp−1) * W ;

Ip−2 ;

Restore Wp−2 before Ip−2 ;W = f ′tp−1(Wp−2) * W ;

I1 ;...

...Restore W0 before I1 ;W = f ′t1 (W0) * W ;X = W ;

Instructions differentiated in the reverse order !Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 17 / 54

Page 22: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Reverse mode: graphical interpretation

time

I I I I I

II

II

1 2 3 p-2 p-1

pp-1

21

Bottleneck: memory usage (“Tape”).

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 18 / 54

Page 23: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Back to the example

v3 = 2.0*v1 + 5.0v4 = v3 + p1*v2/v3

Transposed Jacobian matrices:

f ′t(X ) = ...

1 2

10

1

1 01 p1

v3

1 1− p1∗v2

v23

0

...

v 2 = v 2 + v 4 ∗ p1/v3

...v 1 = v 1 + 2 ∗ v 3

v 3 = 0Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 19 / 54

Page 24: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Reverse Mode example continued

Reverse AD inverses the structure of P :

...v3 = 2.0*v1 + 5.0v4 = v3 + p1*v2/v3

...

.........................../*restore previous state*/v2b = v2b + p1*v4b/v3v3b = v3b + (1-p1*v2/(v3*v3))*v4bv4b = 0.0......................../*restore previous state*/v1b = v1b + 2.0*v3bv3b = 0.0......................../*restore previous state*/

...

Differentiated instructions insertedinto the inverse of P’s original control flow.

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 20 / 54

Page 25: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Control Flow Inversion : conditionals

The control flow of the forward sweepis mirrored in the backward sweep.

...

if (T(i).lt.0.0) then

T(i) = S(i)*T(i)

endif

...

if (...) then

Sb(i) = Sb(i) + T(i)*Tb(i)

Tb(i) = S(i)*Tb(i)

endif

...Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 21 / 54

Page 26: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Control Flow Inversion : loops

Reversed loops run in the inverse order

...

Do i = 1,N

T(i) = 2.5*T(i-1) + 3.5

Enddo

...

Do i = N,1,-1

Tb(i-1) = Tb(i-1) + 2.5*Tb(i)

Tb(i) = 0.0

Enddo

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 22 / 54

Page 27: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Outline

1 Introduction

2 Formalization

3 Reverse AD

4 Memory issues in Reverse AD: Checkpointing

5 Reverse AD for minimization

6 Some AD Tools

7 Static Analyses in AD tools

8 The TAPENADE AD tool

9 Validation of AD results

10 Conclusion

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 23 / 54

Page 28: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Time/Memory tradeoffs for reverse AD

From the definition of the gradient X

X = f ′t(X ).Y = f ′t1 (W0) . . . f ′tp (Wp−1).Y

we get the general shape of reverse AD program:

time

I I I I I

II

II

1 2 3 p-2 p-1

pp-1

21

⇒ How can we restore previous values?

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 24 / 54

Page 29: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Restoration by recomputation

(RA: Recompute-All)

Restart execution from a stored initial state:

time

I I I I I

I

I

I

I

I

1 2 3 p-2 p-1

p

p-1

2

1

1

Memory use low, CPU use high ⇒ trade-off needed !

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 25 / 54

Page 30: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Checkpointing (RA strategy)

On selected pieces of the program, possibly nested,remember the output state to avoid recomputation.

time

p{time

Memory and CPU grow like log(size(P))Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 26 / 54

Page 31: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Restoration by storage

(SA: Store-All)

Progressively undo the assignments made by the forwardsweep

time

I I I I I

IIIIII

1 2 3 p-2 p-1

pp-1p-2321

Memory use high, CPU use low ⇒ trade-off needed !

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 27 / 54

Page 32: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Checkpointing (SA strategy)

On selected pieces of the program, possibly nested, don’tstore intermediate values and re-execute the piece whenvalues are required.

time

C{time

Memory and CPU grow like log(size(P))

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 28 / 54

Page 33: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Checkpointing on calls (SA)

A classical choice: checkpoint procedure calls !

A

B

C

D

A A

B

C

D D D B B

C C C

x : original form of x

x : forward sweep for x

x : backward sweep for x

: take snapshot

: use snapshot

Memory and CPU grow like log(size(P)) when call tree iswell balanced.

Ill-balanced call trees require not checkpointing some calls

Careful analysis keeps the snapshots small.

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 29 / 54

Page 34: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Outline

1 Introduction

2 Formalization

3 Reverse AD

4 Memory issues in Reverse AD: Checkpointing

5 Reverse AD for minimization

6 Some AD Tools

7 Static Analyses in AD tools

8 The TAPENADE AD tool

9 Validation of AD results

10 Conclusion

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 30 / 54

Page 35: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Applications to Minimization

From a simulation program P :

P :(design parameters)γ 7→ (cost function)j(γ)

P :(parameters to estimate)γ 7→ (misfit function)j(γ)

it takes a gradient j ′(γ) to obtain a minimization program.

Reverse mode AD builds program P that computes j ′(γ)

Minimization algorithms (Gradient descent, SQP, . . . )may also use 2nd derivatives. AD can provide them too.

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 31 / 54

Page 36: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

A color picture (at last !...)

AD-computed gradient of a scalar cost (sonic boom)with respect to skin geometry:

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 32 / 54

Page 37: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

... and after a few optimization steps

Improvement of the sonic boom under the plane after 8optimization cycles:

(Plane geometry provided by Dassault Aviation)

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 33 / 54

Page 38: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Data Assimilation (OPA 9.0/GYRE)

Influence of T at -300 metres

on heat flux 20 days later

across North section

30o North

15o North

@@@@-

HHHHHY

Kelvin wave

HHHHH

HHY

Rossby wave

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 34 / 54

Page 39: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Outline

1 Introduction

2 Formalization

3 Reverse AD

4 Memory issues in Reverse AD: Checkpointing

5 Reverse AD for minimization

6 Some AD Tools

7 Static Analyses in AD tools

8 The TAPENADE AD tool

9 Validation of AD results

10 Conclusion

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 35 / 54

Page 40: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Some AD tools

nagware f95 Compiler: Overloading, tangent,reverse

adol-c : Overloading+Tape; tangent, reverse,higher-order

adifor : Regeneration ; tangent, reverse?, Store-All+ Checkpointing

tapenade : Regeneration ; tangent, reverse,Store-All + Checkpointing

taf : Regeneration ; tangent, reverse, Recompute-All+ Checkpointing

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 36 / 54

Page 41: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Some Limitations of AD tools

Fundamental problems:

Piecewise differentiability

Convergence of derivatives

Reverse AD of very large codes

Technical Difficulties:

Pointers and memory allocation

Objects

Inversion or Duplication of random control(communications, random,...)

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 37 / 54

Page 42: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Outline

1 Introduction

2 Formalization

3 Reverse AD

4 Memory issues in Reverse AD: Checkpointing

5 Reverse AD for minimization

6 Some AD Tools

7 Static Analyses in AD tools

8 The TAPENADE AD tool

9 Validation of AD results

10 Conclusion

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 38 / 54

Page 43: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Activity analysis

Finds out the variables that, at some location

do not depend on any independent,

or have no dependent depending on them.Derivative either null or useless ⇒ simplifications

orig. prog tangent mode w/activity analysis

c = a*b

a = 5.0

d = a*c

e = a/c

e=floor(e)

cd = a*bd + ad*bc = a*bad = 0.0a = 5.0dd = a*cd + ad*cd = a*ced=ad/c-a*cd/c**2e = a/ced = 0.0e = floor(e)

cd = a*bd + ad*bc = a*b

a = 5.0dd = a*cdd = a*c

e = a/ced = 0.0e = floor(e)

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 39 / 54

Page 44: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

“To Be Recorded” analysis

In reverse AD, not all values must be restored during the backwardsweep.

Variables occurring only in linear expressions do not appear in thedifferentiated instructions.⇒ not To Be Recorded.

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 40 / 54

Page 45: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

y = y + EXP(a)y = y + a**2a = 3*z

reverse mode: reverse mode:naive backward sweep backward sweep with TBRCALL POP(a)zb = zb + 3*abab = 0.0CALL POP(y)ab = ab + 2*a*ybCALL POP(x)ab = ab + EXP(a)*yb

CALL POP(a)zb = zb + 3*abab = 0.0

ab = ab + 2*a*yb

ab = ab + EXP(a)*xb

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 41 / 54

Page 46: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Snapshots

Taking small snapshots saves a lot of memory:

time

C{ D{Snapshot(C) ⊆ Use(C) ∩ (Write(C) ∪Write(D))

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 42 / 54

Page 47: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Outline

1 Introduction

2 Formalization

3 Reverse AD

4 Memory issues in Reverse AD: Checkpointing

5 Reverse AD for minimization

6 Some AD Tools

7 Static Analyses in AD tools

8 The TAPENADE AD tool

9 Validation of AD results

10 Conclusion

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 43 / 54

Page 48: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

A word on TAPENADE

Automatic Differentiation Tool

Name: tapenade version 2.1Date of birth: January 2002Ancestors: Odyssee 1.7Address: www.inria.fr/tropics/

tapenade.html

Specialties: AD Reverse, Tangent, Vector Tangent, RestructurationReverse mode Strategy: Store-All, Checkpointing on callsApplicable on: fortran95, fortran77, and olderImplementation Languages: 90% java, 10% c

Availability: Java classes for Linux and Windows, or Web server

Internal features: Type-Checking, Read-Written Analysis,Fwd and Bwd Activity, Adjoint Liveness analysis, TBR, . . .

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 44 / 54

Page 49: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

TAPENADE on the web

http://www-sop.inria.fr/tropics

applied to industrial and academic codes:Aeronautics, Hydrology, Chemistry, Biology, Agronomy...

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 45 / 54

Page 50: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

TAPENADE Architecture

Use a general abstract Imperative Language (IL)

Represent programs as Call Graphs of Flow Graphs

trees (IL) trees (IL)

XXX parser

C parser (C)

Fortran95 parser (C)

Fortran77 parser (C)Black-box signatures

XXX printer

C printer

Fortran95 printer (Java)

Fortran77 printer (Java)

other tool

Imperative Language Analyzer (Java)

Differentiation Engine (Java)

User Interface (Java / XHTML)

API

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 46 / 54

Page 51: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Outline

1 Introduction

2 Formalization

3 Reverse AD

4 Memory issues in Reverse AD: Checkpointing

5 Reverse AD for minimization

6 Some AD Tools

7 Static Analyses in AD tools

8 The TAPENADE AD tool

9 Validation of AD results

10 Conclusion

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 47 / 54

Page 52: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Validation methods

From a program P that evaluates

F : IRm → IRn

X 7→ Y

tangent AD creates

P : X , X 7→ Y , Y

and reverse AD creates

P : X , Y 7→ X

Wow can we validate these programs ?

Tangent wrt Divided Differences

Reverse wrt TangentLaurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 48 / 54

Page 53: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Validation of Tangent wrt Divided Differences

For a given X , set g(h ∈ IR) = F (X + h.Xd):

g ′(0) = limε→0

F (X + ε×X )− F (X )

ε

Also, from the chain rule:

g ′(0) = F ′(X )× X = Y

So we can approximate Y by running P twice, at points Xand X + ε× X

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 49 / 54

Page 54: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Validation of Reverse wrt Tangent

For a given X , tangent code returned Y

Initialize Y = Y and run the reverse code, yielding X .We have :

(X · X ) = (F ′t(X )× Y · X )

= Y t × F ′(X )× X

= Y t × Y

= (Y · Y )

Often called the “dot-product test”

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 50 / 54

Page 55: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

Outline

1 Introduction

2 Formalization

3 Reverse AD

4 Memory issues in Reverse AD: Checkpointing

5 Reverse AD for minimization

6 Some AD Tools

7 Static Analyses in AD tools

8 The TAPENADE AD tool

9 Validation of AD results

10 Conclusion

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 51 / 54

Page 56: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

AD: Context

DERIVATIVES

Div. Diff Analytic Diff

Maths AD

Overloading Source Transfo

Multi-dir Tangent Reverse

inaccuracy

programming

control

flexibility

efficiency

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 52 / 54

Page 57: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

AD: To Bring Home

If you want the derivatives of an implemented mathfunction, you should seriously consider AD.

Divided Differences aren’t good for you (nor forothers...)

Especially think of AD when you need higher order(taylor coefficients) for simulation or gradients(reverse mode) for sensitivity analysis or optimization.

Reverse AD is a discrete equivalent of the adjointmethods from control theory: gives a gradient atremarkably low cost.

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 53 / 54

Page 58: Automatic Differentiation by Program Transformationby Program transformation SUBROUTINE FOO(v1, v2, v4, p1) REAL v1,v2,v3,v4,p1 v3 = 2.0*v1 + 5.0 v4 = v3 + p1*v2/v3 END v3d = 2.0*v1d

AD tools: To Bring Home

AD tools provide you with highly optimized derivativeprograms in a matter of minutes.

AD tools are making progress steadily, but the bestAD will always require end-user intervention.

Laurent Hascoet (INRIA) Automatic Differentiation CEA-EDF-INRIA 2006 54 / 54