Course notes on Financial Mathematics

Course notes on Financial Mathematics

Jack D. CowanDepartment of Mathematics,

The University of Chicago, 5734 S. Univ. Ave.,

Chicago, Illinois 60637

June 2, 2003

Abstract

These notes provide an informal introduction to the mathematics of optionpricing. They are at present largely based on the books Options Markets

by Cox & Rubinstein, The Mathematics of Financial Derivatives: A Stu-

dent Introduction, by Wilmott, Howison & Dewynne, Handbook of Stochastic

Methods, by Gardiner, and Path Integrals in Chemistry, Physics and Biology

by Wiegel. However over the course of this academic year they will graduallyevolve into an independent set which deals with many new topics.

Contents

1 Derivatives 7

1.1 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.1.1 European call options . . . . . . . . . . . . . . . . . . . 71.1.2 European put options . . . . . . . . . . . . . . . . . . . 91.1.3 Put-call parity . . . . . . . . . . . . . . . . . . . . . . . 91.1.4 Types of options . . . . . . . . . . . . . . . . . . . . . 101.1.5 How to read the financial pages . . . . . . . . . . . . . 101.1.6 What good are options? . . . . . . . . . . . . . . . . . 14

1.2 Other financial instruments . . . . . . . . . . . . . . . . . . . 141.3 Interest rates . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.3.1 Discounting . . . . . . . . . . . . . . . . . . . . . . . . 141.4 Another way to compute the value of an option . . . . . . . . 151.5 Statistics of the Market . . . . . . . . . . . . . . . . . . . . . . 16

2 Random walks and Markov processes 21

2.1 Binomial random walks . . . . . . . . . . . . . . . . . . . . . . 212.1.1 The central limit theorem . . . . . . . . . . . . . . . . 22

2.2 More on binomial random walks . . . . . . . . . . . . . . . . . 232.2.1 The central limit of a binomial random walk . . . . . . 24

2.3 Moments of the return on S . . . . . . . . . . . . . . . . . . . 252.4 Gaussian Markov processes . . . . . . . . . . . . . . . . . . . . 26

3 Stochastic calculus 32

3.1 The Wiener–Bachelier process . . . . . . . . . . . . . . . . . . 323.2 The Ito calculus . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.1 Rules for stochastic differentiation . . . . . . . . . . . . 373.2.2 Ito’s lemma . . . . . . . . . . . . . . . . . . . . . . . . 38

1

3.2.3 From stochastic differential equations to the Fokker–Planck equation . . . . . . . . . . . . . . . . . . . . . . 39

3.2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 The Fokker–Planck equation 42

4.1 Boundary conditions . . . . . . . . . . . . . . . . . . . . . . . 434.2 Forward and backward equations . . . . . . . . . . . . . . . . 44

4.2.1 Boundary conditions for the backward equation . . . . 474.3 Stationary solutions . . . . . . . . . . . . . . . . . . . . . . . . 484.4 Non–stationary solutions. Eigenvalues and Eigenfunctions. . . 49

4.4.1 Eigenfunction expansions . . . . . . . . . . . . . . . . . 524.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5.1 Finite domains . . . . . . . . . . . . . . . . . . . . . . 53

5 Distributions and Green’s functions 56

5.1 The derivative of a distribution . . . . . . . . . . . . . . . . . 585.2 Green’s functions . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.1 Some symmetries of solutions of the Wiener–Bachelierequation . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2.2 Further properties of solutions of the Wiener–Bachelierequation . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3 The relationship between Green’s functions and eigenfunctions 65

6 Path integrals 67

6.1 How to compute path integrals . . . . . . . . . . . . . . . . . 696.1.1 The spectral representation . . . . . . . . . . . . . . . 706.1.2 The cell representation . . . . . . . . . . . . . . . . . . 72

6.2 The connection between the path integral and the Wiener–Bachelier equation . . . . . . . . . . . . . . . . . . . . . . . . 74

6.3 A generalization . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7 The Feynman–Kac formula 80

7.1 The backward equation . . . . . . . . . . . . . . . . . . . . . . 817.1.1 Extension to non–zero µ . . . . . . . . . . . . . . . . . 82

7.2 More general Markov processes . . . . . . . . . . . . . . . . . 827.3 An alternative derivation . . . . . . . . . . . . . . . . . . . . . 84

2

8 Options 87

8.1 Deriving the Black–Scholes equation . . . . . . . . . . . . . . 898.1.1 More frequent trading . . . . . . . . . . . . . . . . . . 92

8.2 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . 948.2.1 Martingales and options . . . . . . . . . . . . . . . . . 97

8.3 Girsanov’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . 988.4 Solving the Black–Scholes equation . . . . . . . . . . . . . . . 1008.5 Another derivation of the Black–Scholes equation . . . . . . . 1048.6 The Greeks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1058.7 Hedging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

8.7.1 Delta hedging . . . . . . . . . . . . . . . . . . . . . . . 1078.8 Implied volatility . . . . . . . . . . . . . . . . . . . . . . . . . 1078.9 Dividends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

8.9.1 Effects of dividends on boundary conditions . . . . . . 109

9 American options 110

9.1 Boundary conditions for American options . . . . . . . . . . . 1119.2 The obstacle problem . . . . . . . . . . . . . . . . . . . . . . . 1129.3 Dividends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1169.4 A local analysis of the free boundary . . . . . . . . . . . . . . 119

10 Dividends revisited 124

10.1 Fixed dividend payments . . . . . . . . . . . . . . . . . . . . . 12410.2 Jump conditions . . . . . . . . . . . . . . . . . . . . . . . . . 12610.3 An alternative derivation of the jump condition . . . . . . . . 12810.4 The meaning of jump conditions . . . . . . . . . . . . . . . . . 130

11 A generalization 131

11.1 Interest rate and volatility known functions of time . . . . . . 13211.2 Trading volatility . . . . . . . . . . . . . . . . . . . . . . . . . 133

12 Exotic options 135

12.1 A unifying framework . . . . . . . . . . . . . . . . . . . . . . . 14012.2 Discrete sampling . . . . . . . . . . . . . . . . . . . . . . . . . 14212.3 Barrier options . . . . . . . . . . . . . . . . . . . . . . . . . . 14312.4 Asian options . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

12.4.1 Continuously sampled averages . . . . . . . . . . . . . 14912.4.2 Geometric averaging . . . . . . . . . . . . . . . . . . . 150

3

12.4.3 Discretely sampled averages and jump conditions . . . 15012.4.4 Similarity reductions for arithmetic Asian options . . . 15312.4.5 The continuously sampled average strike option . . . . 15412.4.6 Put–call parity for the European average strike option 15612.4.7 The American average strike option . . . . . . . . . . . 15712.4.8 Average strike foreign exchange options . . . . . . . . . 15912.4.9 Average rate options . . . . . . . . . . . . . . . . . . . 16012.4.10Geometric averaging and discrete sampling . . . . . . . 160

13 Bond pricing 163

13.1 Another derivation of the bond equation . . . . . . . . . . . . 16713.1.1 The yield curve . . . . . . . . . . . . . . . . . . . . . . 16813.1.2 Stochastic interest rates . . . . . . . . . . . . . . . . . 16913.1.3 The bond equation . . . . . . . . . . . . . . . . . . . . 16913.1.4 The market price of risk . . . . . . . . . . . . . . . . . 171

13.2 Solving the bond pricing equation . . . . . . . . . . . . . . . . 17213.2.1 Fitting the parameters . . . . . . . . . . . . . . . . . . 174

13.3 Interest rate products . . . . . . . . . . . . . . . . . . . . . . . 17613.3.1 Bond options . . . . . . . . . . . . . . . . . . . . . . . 17613.3.2 Swaps and caps . . . . . . . . . . . . . . . . . . . . . . 17713.3.3 Swaptions, captions, and floortions . . . . . . . . . . . 177

13.4 Convertible bonds . . . . . . . . . . . . . . . . . . . . . . . . . 17813.4.1 Call and put aspects of convertible bonds . . . . . . . . 18013.4.2 Convertible bonds with random interest rates . . . . . 18013.4.3 The issue of new shares . . . . . . . . . . . . . . . . . . 182

14 Transaction costs 184

14.0.4 A modified Black–Scholes equation . . . . . . . . . . . 18414.0.5 Portfolios of options . . . . . . . . . . . . . . . . . . . 188

15 Time Series 190

15.1 Linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . 19015.1.1 Time domain analysis . . . . . . . . . . . . . . . . . . 19015.1.2 Frequency domain analysis . . . . . . . . . . . . . . . . 19415.1.3 Gain and phase diagrams . . . . . . . . . . . . . . . . . 195

15.2 Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . 19915.2.1 Stationary processes . . . . . . . . . . . . . . . . . . . 20015.2.2 Spectral analysis . . . . . . . . . . . . . . . . . . . . . 200

4

15.3 Linear systems identification . . . . . . . . . . . . . . . . . . . 20415.3.1 Frequency domain methods . . . . . . . . . . . . . . . 20415.3.2 Time domain methods . . . . . . . . . . . . . . . . . . 206

15.4 State–space models . . . . . . . . . . . . . . . . . . . . . . . . 20815.4.1 Parameter estimation . . . . . . . . . . . . . . . . . . . 21015.4.2 The Kalman filter . . . . . . . . . . . . . . . . . . . . . 212

16 Neural nets 214

16.1 Logic, computation, and McCulloch–Pitts nets . . . . . . . . . 21416.1.1 Truth tables . . . . . . . . . . . . . . . . . . . . . . . . 21416.1.2 Venn and Peirce diagrams . . . . . . . . . . . . . . . . 21516.1.3 The Hilbert–Ackerman theorem . . . . . . . . . . . . . 21716.1.4 Godel’s theorem and Turing machines . . . . . . . . . . 21816.1.5 McCulloch–Pitts nets . . . . . . . . . . . . . . . . . . . 219

16.2 Perceptrons and Adalines . . . . . . . . . . . . . . . . . . . . 22516.2.1 The Perceptron training algorithm . . . . . . . . . . . 22816.2.2 The Adaline training algorithm . . . . . . . . . . . . . 23316.2.3 Comparision of Perceptrons and Adalines . . . . . . . . 23516.2.4 Some problems with the Adaline algorithm . . . . . . . 23716.2.5 A variation of the Algorithm algorithm . . . . . . . . . 23716.2.6 The ‘XOR’ problem . . . . . . . . . . . . . . . . . . . . 23916.2.7 Why are Perceptrons and Adaline’s useful? . . . . . . . 243

16.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . 24616.3.1 Feedforward nets . . . . . . . . . . . . . . . . . . . . . 24616.3.2 Recurrent nets . . . . . . . . . . . . . . . . . . . . . . 25116.3.3 The Williams–Zipser algorithm . . . . . . . . . . . . . 25816.3.4 Another derivation of the Williams–Zipser algorithm . 260

16.4 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . 26416.4.1 Principal Component Analysis . . . . . . . . . . . . . . 26416.4.2 A neural net implementation of PCA . . . . . . . . . . 26616.4.3 Independent Component Analysis . . . . . . . . . . . . 26816.4.4 Blind source separation . . . . . . . . . . . . . . . . . . 272

16.5 Neural nets in finance . . . . . . . . . . . . . . . . . . . . . . . 27316.5.1 Some general remarks . . . . . . . . . . . . . . . . . . 27316.5.2 Pre–processing the data . . . . . . . . . . . . . . . . . 27416.5.3 Training and testing . . . . . . . . . . . . . . . . . . . 27716.5.4 General guidelines . . . . . . . . . . . . . . . . . . . . 27816.5.5 An example of financial forecasting . . . . . . . . . . . 280

5

16.5.6 Other applications of neural nets and related algorithmsto finance . . . . . . . . . . . . . . . . . . . . . . . . . 282

16.5.7 Learning the Black–Scholes formula . . . . . . . . . . . 283

6

Chapter 1

Derivatives

1.1 Options

One common example of a derivative or derivative security is the European

call option. This is an option to buy shares of an asset-the underlying-at aprescribed future expiration date or expiry for an agreed amount-the exerciseor strike price. The purchaser or holder of such an option has a right but notan obligation to buy the shares at expiry. Conversely the seller or writer ofthe option is obligated to sell the shares to the buyer if he chooses to exercisethe option. Such an option has some value since it gives its holder a rightwith no obligation. So the holder must compensate the writer who has theobligation. This raises two key questions:

• What is the value of the call?

• In what ways can its writer minimise the risk associated with the call?

A great deal of the mathematics of option pricing is concerned with thesequestions.

1.1.1 European call options

Consider the following simplified example (see Cox & Rubinstein 1985). Sup-pose that today (October 5 1998) one share of the asset S is quoted at $50.A call C on S is available with a strike price E of $50 on December 18 1998.It is also possible to borrow money at an interest rate of %25 over the period

7

between now and expiry. What is the value of C? It turns out that this canbe uniquely determined in case one critical condition holds, namely:

• Opportunities to make instantaneous risk-free profits do not exist.

This is known as the arbitrage condition. It can be applied to the abovesituation as follows: write 3 calls on S at $C each and buy 2 shares of Sat $50 each. Thus the share purchase risk is hedged by the writing of calloptions on the underlying– if the share price falls the option will be worthlesssince its holder will not buy shares at double their quoted value, so its writerprofits. Now leverage the purchase by borrowing $40 to be repaid at expiry.Let S at expiry be denoted by S ′ and suppose that it is quoted at $25 in onescenario, and at $100 in another. Consider the following arbitrage table:

now expiry expiry

S ′ = 25 S ′ = 100write 3 calls 3C 0 −150buy 2 shares −100 50 200

borrow money 40 −50 −50

Table 1.1: Arbitrage conditions for a call

In the first scenario the call is worthless at expiry, so the writer profits. In thesecond scenario the writer must sell 3 shares at half their quoted value, so heloses money. However it is evident that the leveraged hedge has minimisedthe writer’s risk– in either scenario the net loss (or profit) at expiry is $0 . Itfollows from the arbitrage condition that the hedge must also show zero netloss (or profit) now, thus 3C − 100 + 40 = 0, whence C = $20.

Three conclusions can be drawn from this example:

• An appropriate leveraged position in stock can replicate the futurereturns of a call.

• To determine C one needs to know the values of S, S ′ = S ± dS, E ,the interest rate r, and the time to expiry T.

• One does NOT need the probabilities p and q for S ′ → S±dS, but onedoes need to know the range of variation or volatility of S.

8

1.1.2 European put options

Another common derivative security is the European put option. This issimilar to the European call option except that the holder has the right tosell shares of an asset at expiry at a fixed strike price. The writer of such aput is then obliged to buy the shares at the strike price. Consider a situationsimilar to that one described earlier in which an asset S is quoted today at$50. A put P on S is available with a strike price E of $50 on December 181998. What is the value of P ? Buy 6 puts on S at $P each and 2 shares of Sat $50 each. Leverage the purchase by borrowing $160 to be repaid at expiry.Suppose that S ′ is quoted at $25 in one scenario, and at $100 in another.Now consider the following arbitrage table:

now expiry expiry

S ′ = 25 S ′ = 100buy 6 puts −6P 150 0

buy 2 shares −100 50 200borrow money 160 −200 −200

Table 1.2: Arbitrage conditions for a put

In the first scenario the puts are worth $150 since the holder can sell sharesto the writer at a profit of $25 per share. Conversely in the second scenariothe puts are worthless since the holder will not sell shares at half their quotedvalue. However it is clear that once again the leveraged hedge has minimisedthe writer’s risk–in either scenario the net loss (or profit) at expiry is $0.Evidently the arbitrage condition requires that −6P − 100 + 160 = 0, sothat P = $10. Once again P depends only on S, S ′, E, r and T, and thevolatility of S.

1.1.3 Put-call parity

It follows from the above that C−P −S = $20−$10−$50 = −$40. But $40is just E exp(−rT ), the strike price discounted by the interest to be paid atexpiry. Thus we can write the equation:

C − P = S − E exp(−rT ). (1.1)

9

for the put-call relationship. This is called put–call parity. It can also bederived from the following leveraged hedge: buy a put at $P and a shareof stock at $S, write a call at $C and borrow $E exp(−rT ). Consider thefollowing (symbolic) arbitrage table:

now expiry expiry

S ′ ≤ E S ′ > Ewrite 1 call C 0 E − S ′

buy 1 put −P E − S ′ 0buy 1 share −S S ′ S ′

borrow money E exp(−rT ) −E −E

Table 1.3: Arbitrage conditions for a put and a call

Evidently the arbitrage condition leads directly to eqn.(1.1). Note that wehave incorporated in this table the call and put values at expiry, namely:

C = max(S ′ − E, 0), P = max(E − S ′, 0) (1.2)

At expiry it follows that C − P = max(S ′ −E, 0)−max(E − S ′, 0) = S ′ −Eas required by eqn.(1.1).

1.1.4 Types of options

So far we have considered only simple European calls and puts–what areknown as vanilla options. However there exist many other options. A muchmore commonly traded one is the American option. Such an option is similarto a European option except that it can be exercised at any time up to expiry.This property makes such options more difficult to price than European ones.As we will see, to price American options one needs to solve a free boundaryproblem.

1.1.5 How to read the financial pages

Figure1.1 shows some listed options quotations from the September 28, 1998issue of the Wall Street Journal.

10

Figure 1.1: Option price quotes from the Wall Street Journal, September 25,1998.

11

The table shows the composite volume and closing prices for activelytraded equities and for some indexes. Each row of the table lists the lastquoted price of a share of stock on Friday, September 25, 1998 and thecorresponding strike price, expiry, and the latest quoted prices for calls andputs.

Consider for example the prices quoted for Apple Computer options, re-produced in the following table:

Call Put

Option Strike Expiry Last Last

AppleC 3712

Oct 318

11316

40 Oct 2 116

314

Table 1.4: Calls and Puts on Apple Computer, Sept. 28, 1998

Note that since a call permits the holder to buy shares at the strike price,the call with strike price of $40 per share is cheaper than that one with strikeprice of $37 1

2, and conversely for the puts since the holder of a put with $40

strike price can make a bigger profit by selling the share at the strike price,than can the holder of the put with $37 1

2strike price.

Another example is the NASDAQ–100 (NDX) index option traded on theChicago Board Options Exchange (CBOE), a subset of which is shown in thefollowing table:

Although the index is just a number the contract based on it is given anominal price in dollars equal to ten times the index. Figure 1.2 shows a plotof call price quotes against strike prices for the October expiry. Superimposedon this plot is the payoff function at this expiry–max(I ′ − E, 0), with I ′ theindex value at expiry taken to be 1373.525 –the average value of C − P +Eover the first five quotes. This is less than 1390.09, the closing value of theindex on September 25. We could have chosen this value for I ′, but put-callparity at expiry leads to the inequality C ≥ I ′ − E which would be violatedby the first entry in the table.

12

Expiry Strike Call Put

October 1200 180 8 12

October 1280 110 34

1678

October 1320 84 39 14

October 1360 61 56October 1400 44 51 1

2

November 1400 76 −December 1400 66 −November 1460 46 −October 1480 12 −

November 1500 31 −October 1560 2 13

16−

Table 1.5: Calls and Puts on the NASDAQ–100 index

16001500140013001200 1250 1350 1450 1550

180

160

140

120

100

80

60

40

20

0

Figure 1.2: Nasdaq Call Prices and Payoff Function vs. Strike Prices

13

1.1.6 What good are options?

It follows from all this that options can be used to hedge, i.e. to minimise therisk involved in buying and selling equities. They can also be used simplyto speculate. The advantage over simply speculating with equities is thatgreater leverage can be exerted since options usually cost some small fractionof the price of the underlying.

1.2 Other financial instruments

In addition to options there also exist other contingent claims, namely for-

ward and futures contracts. A foward contract is an agreement between twoparties in which one contracts to buy an asset from the other for a predeter-mined or forward price at a specific future delivery or maturity date. It costsnothing to enter into a futures contract but the two parties are obligated tocarry out the transaction at maturity. A futures contract is very similar to aforward contract except that futures are usually traded on an exchange whichregulates their scale and delivery dates. There is also a margin requirement:the value of a futures contract is evaluated daily, and any change in value ispaid to one or the other party by the other. This protects the parties fromdefault. Since there is no choice in such contracts they are easier to pricethan options. In fact when interest rates are predictable, futures and forwardcontracts coincide and a simple calculation determines their value.

1.3 Interest rates

In these early notes (following Wilmott et.al.) it will be assumed that theshort–term bank deposit interest rate is a known function of time r(t). Thisis reasonable since options on equities commonly last about nine months,during which time changes in r are not usually large enough to affect optionprices significantly, i.e., by more than 2%.

1.3.1 Discounting

In pricing options one needs to know the present value or discounting ofbank loans etc (we have already used this in establishing put–call parity.)

14

Assuming a continuously compounded interest rate r(t), a bank deposit M(t)grows exponentially at the rate

dM

dt= r(t)M(t) (1.3)

the solution of which is

M(t) = c exp(

∫ t

r(s) ds)

Now let the future value of M at time T be M(T ) = E. Then

M(t) = E exp(−∫ T

t

r(s) ds) (1.4)

Evidently if r(t) = r, a constant then M(t) reduces to E exp(−r(T − t)).

1.4 Another way to compute the value of an

option

We can in fact compute the value of an option without reference to arbitrage,although we need to look at table1.1. From this we can see that from thepoint of view of the holder of a single call, its price can obtained by averagingthe payoff at expiry with respect to the two possible outcomes, each assumedto be equally probable, and then discounting the result with respect to theinterest rate over the period of the contract. Thus C = ( 1

2· $0+ 1

2· $50)/(1+

14) = $20. Similarly with respect to table1.2 the value of a put to the holder

should be P = (12· $25 + 1

2· $0)/(1 + 1

4) = $10. Let Λ(S) be a general payoff

function for an option whose price is V at time τ ., and let P (S ′, T |S, τ) bethe probability distribution of stock prices at expiration, given the value Sat T − τ . Then this calculation suggests the general result:

V = exp(−r(T − τ))

∫

S′

P (S ′, T |S, τ)Λ(S ′)dS ′ = exp(−r(T − τ)) < Λ(S) >

(1.5)where < Λ(S) > is the expectation or average of the payoff function Λ(S)over the distribution of stock prices. This result is known as the Feynman-Kac formula. It follows that if we know or can estimate the movement of

15

stock prices, i.e., if we know the transition probability density P (S ′, T |S, τ)then we can compute the price of any option. We derive the Feynman-Kacformula in a later section.

1.5 Statistics of the Market

The foundations of modern option pricing go back to 1900 when Bachelierin a paper entitled Theory of Speculation introduced the idea that investingin stocks is a form of speculation or gambling in a fair game, and that stockprices follow a random walk (Bachelier 1900). Shortly after this Einstein(1908) analyzed such random walks as a model for Brownian motion. Thisidea when applied to the return on a stock

dS

S= d(logS) (1.6)

underlies the well–known work of Black & Scholes (1973) and Merton (1973).In what follows a standardized variable

z =S − µ

σ(1.7)

where µ and σ are, respectively, the mean and standard deviation of S, willoften be used. Evidently z has zero mean and unit variance.

The Black-Scholes-Merton model implies that the increments d(logS)are independent, identically distributed Gaussian random variables. In factif one considers each change in logS to be the result of the sum of manysmall independent random contributions generated by various market factors,i.e., if the market is liquid, then the Central Limit Theorem leads naturallyto the Gaussian hypothesis. But consider figure1.3 in which φ(dx(T )), thedistribution of increments dx(T ) = x(t+T )−x(t) in the US $/Deutschemarkexchange rate futures (sampled every 5 minutes) is plotted. This incrementasset distribution differs markedly from Gaussian in that it has “heavy” or“fat tails” so that, relatively speaking, it is sharper than Gaussian. A measureof this is the kurtosis, κ =< dx4 > / < dx2 >2 −3. It is easy to show thatif φ is Gaussian, < dx2 >= 1, < dx4 >= 3 so that κ = 0. The distributionplotted in fig1.3 has κ = 74. Other observed values for increments sampledat 5 minute intervals are κ = 60 (US$/Swiss Franc exchange rate futures),and κ = 16 (S&P500 index futures.). Similar results obtain for Dow Jones

16

Figure 1.3: Probability density of USD/Deutschemark exchange rate futures. Thelower curve is a Gaussian with the same mean and variance [redrawn from Contet.al. 1997.]

17

industrials and twenty year US T–bond yields. Using a Gaussian densityfor such data systematically underestimates the probability of large pricefluctuations. It is also the case that in liquid markets, the autocorrelationfunction c(T ) =< dx(t) dx(t+T ) > rapidly decays to zero in a few minutes: ifthe lag T is greater than about 15 minutes, then c is effectively zero. Fig1.4shows this effect which is often cited as evidence in favor of the efficient

"! #

"! #

"!

"! #

"! #

Figure 1.4: Autocorrelation function of price increments for the US$/Yen ex-change rate [redrawn from Cont et.al. 1997.]

market hypothesis which holds that

• past history is fully reflected in the current stock price

• markets respond very quickly to new information

• arbitrage opportunities quickly disappear

Thus if price changes were to exhibit significant correlation this could beused for arbitrage. But such a strategy would tend to reduce the correlationexcept at very short time scales representing the time the market takes toreact to new information. This time lag is of the order of a few minutes fororganized futures markets, and even smaller for foreign exchange markets.

18

Given that price increments are uncorrelated, are they also independent,so that < dx(t) dx(t + T ) >=< dx(t) >< dx(t + T ) >? Fig1.5 shows theautocorrelation function of the square of the increments, defined as:

g(T ) =< dx2(t) dx2(t+ T ) > −1

< dx4(t) > −1,

for S&P 500 futures.

!#"$ &%

')(*+, *.- $

Figure 1.5: Autocorrelation function of the square of price increments in S&P500index futures [redrawn from Cont et.al. 1997.]

It will be seen that this function decays very slowly to zero with a longtail. This is related to the fact that whereas the variance σ2(T ) =< dx(T )2 >scales linearly with T, the kurtosis scales, not as T−1 (as it would do foridentical independent random variables), but as T−0.5 as seen in fig1.6:

These facts suggest that the representation of market price changes asa random walk has limits, and that one needs to incorporate the effects ofnonlinearity and nonstationarity to give a better account of volatility. Inwhat follows we first look at the random walk model, i.e. at the Black-Scholes-Merton model as a zeroth approximation to market prices changes.Following this we look at models that incorporate some of the propertiesdescribed above.

19

!" !

# $%& %'( )$*+-, %$* )./&)./ 0') %

Figure 1.6: Scaling behavior of the kurtosis of price increments in S&P500 indexfutures [redrawn from Cont et.al. 1997.]

20

Chapter 2

Random walks and Markov

processes

2.1 Binomial random walks

We start by considering stock price changes over an interval or period τ.Suppose S(t) changes to S ′(t + τ) = uS(t) with probability p, or to dS(t)with probability q = 1 − p, where u > 1 and d < 1. Since d(uS) = u(dS),it follows that after n intervals or periods, S ′/S, the return on S, can takethe values un, dun−1, d2un−2, ..., dn−1u, dn. So if we index the intervals byj = 0, 1, 2, ..., n, the j th value is ujdn−j and this occurs with probability

pj =(nj

)pjqn−j (2.1)

where p+ q = 1 and p1 + p2+ · · ·+ pn = 1. Following Feller (1950) we writepj as b(j, n; p). It follows that

P [S ′

S≥ uadn−a] =

n∑

j=a

b(j, n; p) = Φ(a, n; p) (2.2)

the (complimentary) binomial distribution. Eqn.2.1 is called Bernoulli’s

formula, and the process as a whole is called the Bernoulli trials model.Figure2.1 shows a typical binomial density b(j, 15, 0.5).

It will be seen that although this density is discrete, its envelope consistsof points which lie on a bell shaped curve. It is easy to compute the mean

21

Figure 2.1: The Bernoulli density b(j, 15, 0.5)

and variance of this density. Thus

µ =< j >=

n∑

j=0

j b(j, n; p) = np (2.3)

andσ2 =< (j − µ)2 >=< j2 > −µ2 = npq. (2.4)

There is also an interesting limit theorem due to Bernoulli, namely that

limn→∞

P [

∣∣∣∣j

n− p

∣∣∣∣ > ε] = 0. (2.5)

This is one form of the law of large numbers. It tells us that the probabilitythat j will differ from the average value µ = np after a large number of inter-vals n tends to zero with increasing n. It also validates the intuitive notionthat j/n, the frequency of positive increments in S, provides an estimate ofthe probability p of the occurence of a positive increment in any one interval.

2.1.1 The central limit theorem

There is also another limit theorem of even more relevance to our analysis.This is the De Moivre-Laplace or central limit theorem, which tells us that

22

given certain conditions to be defined below, the binomial density can beapproximated by a suitable Gaussian density. Let

φ(x) =1√2π

exp(−x2

2)

then

Φ(x) =

∫ x

−∞φ(y) dy (2.6)

is the normal or Gaussian distribution.Now let

h =1√npq

, xj =j − np√npq

and suppose α and β are such that hx3α and hx3

β tend to zero as n → ∞.Then

limn→∞

P [α ≤ j ≤ β] = Φ(xβ+ 1

2) − Φ(xα+ 1

2). (2.7)

There is a weaker form which states that for every fixed a < b,

limn→∞

P [a ≤ xj ≤ b] = Φ(b) − Φ(a) (2.8)

2.2 More on binomial random walks

Consider now the random walk shown in figure 2.2:

S

S'uu

u

d

d

Figure 2.2: A simple random walk

Evidently the return S ′/S = u3d2, so that log(S ′/S) = 3 log u+2 log d. Moregenerally

logS ′

S= j log u+ (n− j) log d = j log

u

d+ n log d. (2.9)

23

So

< logS ′

S>=< j > log

u

d+ n log d

and

var[logS ′

S] = var[j] (log

u

d)2.

It follows from eqns.2.3and2.4 that

< logS ′

S>= [p log

u

d+ log d]n = µn (2.10)

and

var[logS ′

S] = pq(log

u

d)2n = σ2n, (2.11)

so both µ and σ2 depend only on p, u and d.

2.2.1 The central limit of a binomial random walk

Now assume the total duration of the walk is t < ∞. Thus t = nτ. Letn→ ∞, then τ = t/n → 0. What about p, u and d? We can choose them sothat µτ−1 and σ2τ−1 remain finite as τ → 0. Let

u = exp(σ√τ), d = exp(−σ√τ ), p =

1

2(1 +

µ

σ

√τ) (2.12)

It follows that

µτ−1 = [1

2(1 +

µ

σ

√τ) 2σ

√τ − σ

√τ ]τ−1 = µ

and

σ2τ−1 = [1

2(1 +

µ

σ

√τ )

1

2(1 − µ

σ

√τ) 4σ2τ ] τ−1 = σ2 − µ2τ = σ2 − O(τ),

and that

limn→∞

µn = limτ→0

µt

τ= µ t; lim

n→∞σ n = lim

τ→0σt

τ= σ2 t. (2.13)

Note also that eqns.2.9, 2.10 and 2.11 imply that

h =1√npq

= logu

d(σ√n)−1

24

and

xj = (j − np) h =log S′

S− µn

σ√n

.

Thus using eqns.2.12 and 2.13

hx3j =

log ud(log S′

S− µn)3

(σ2n)2=

2σ√τ (log S′

S− µt)3

(σ2 − µ2τ) t→ 0, as τ → 0.

Thus the conditions for the De Moivre–Laplace limit hold, i.e.

Pr[(log S′

S− µn)

σ√n

≤ b] → Φ (b) (2.14)

i.e.(log S′

S− µt)

σ√t

is normally distributed with mean µt and variance σ2t.

2.3 Moments of the return on S

We now introduce the characteristic function

< eisx >=

∞∫

−∞

eisxP (x)dx = φ(s) (2.15)

It follows that

< xr >= (−i ∂∂s

)rφ(s)|s=0, (2.16)

so that for a normal distribution

φ(s) = exp[isµt− 1

2s2σ2t] (2.17)

whence

< x >= µt (2.18)

< x2 >= σ2t + (µt)2

< x3 >= 3(σ2t)(µt) + (µt)3

< x4 >= 3(σ2t)2 + 6(µt)2(σ2t) + (µt)4.

25

So if x = log S ′/S has a normal characteristic function then

< (S ′

S)r >=< (ex)r >=< erx >= φ(−ir) = exp[rµt+

1

2r2σ2t]. (2.19)

Thus ⟨log

S ′

S

⟩= µt→

⟨S ′

S

⟩= exp

[µt+

1

2σ2t

](2.20)

and conversely

⟨S ′

S

⟩= exp [µt] →

⟨log

S ′

S

⟩= µt− 1

2σ2t (2.21)

Thus if logS ′/S is normally distributed with mean µt, then S ′/S has meanexp[µt + 1

2σ2t], and conversely if S ′/S is normally distributed with mean

exp[µt] then log S ′/S has mean µt− 12σ2t. This is a key result and is closely

related to the Ito calculus of stochastic derivatives.

2.4 Gaussian Markov processes

Now consider again the binomial random walk S → uS with probability p,and S → dS with probability q = 1 − p, with ud = 1 . Let x = logS,then we can rewrite the random walk as x→ x + dx with probability p andx → x − dx with probability q = 1 − p where dx = log u. Let P (x, t | x0, t0)be the probability density for the random walk to reach the value x at time t,starting from the initial value x0 at time t0. Then we can define the randomwalk by the recursion:

P (x, t+ h | x0, t0) = p P (x− log u, t | x0, t0) + q P (x+ log u, t | x0, t0). (2.22)

Taylor expansion of both sides of 2.22 about the point (x, t) gives

[1 + h∂

∂t+

1

2!h2 ∂

2

∂t2+ · · · ]P (x, t | x0, t0)

= [1 + (q − p) log u∂

∂x+

1

2!(log u)2 ∂

2

∂x2+ · · · ]P (x, t | x0, t0).

But from eqns.2.12

(q − p) log u = −µh, (log u)2 = σ2h. (2.23)

26

Thus equating terms of differing powers of h we find at 0(h), the partialdifferential equation

∂

∂tP (x, t | x0, t0) = (−µ ∂

∂x+

1

2σ2 ∂

2

∂x2)P (x, t | x0, t0). (2.24)

Equation 2.24 is a Fokker–Planck equation for the Gaussian random processwith drift µt and variance σ2t. This limit of the binomial random walk iscalled the Wiener process, or sometimes the Wiener–Bachelier process. Itssolution for the initial condition

P (x, t0 | x0, t0) = δ(x− x0) (2.25)

can be obtained in several ways. One way is to use the characteristic functiondefined in eqn.2.15. It then follows from eqns.2.24 and 2.25 that

φ(s, t0) =< eisx >=

∞∫

−∞

eisxδ(x− x0)dx = eisx0 (2.26)

and φ(s, t) satisfies

∂

∂tφ(s, t) = (isµ− 1

2s2σ2)φ(s, t)

so that

φ(s, t) = exp[(isµ− 1

2s2σ2)(t− t0)]φ(s, t0) = exp[isx0 +(isµ− 1

2s2σ2)(t− t0)],

(2.27)It follows that

P (x, t | x0, t0) =1

2π

∞∫

−∞

φ(s, t)e−isxds

=1√

2πσ2(t− t0)exp

[−(x− x0 − µ(t− t0))

2

2σ2(t− t0)

]

(2.28)

This is a normal distribution with mean x0 +µ(t−t0) and variance σ2(t−t0).Since x = log S, it follows that P (x)dx = P (logS)d logS = P (logS)dS/S.

27

Thus

P (S, t |Se, t0) =1

S√

2πσ2(t− t0)exp

[−

(log SSe

− µ(t− t0))2

2σ2(t− t0)

](2.29)

the lognormal distribution.The transition probability satisifies the Chapman-Kolomogorov equation:

P (x, t | x0, t0) =∑

x′(t′)P (x, t | x′, t′)P (x′, t′ | x0, t0), t = t′ = t0 (2.30)

This is the Markov property. It says that starting from x0 at time t0, theprobability of reaching x at time t is path independent. Such a property isobviously true for the random walks considered earlier.

Equation 2.24 can be derived directly from the Chapman-Kolomogorovrelation. Let

x(t + h) = xh (2.31)

φ(s, t, h) = < exp(is(xh − x) >=< exp(is4x) > . (2.32)

Then φ(s, t, h) and P (xh, t+ h | x, t) form a Fourier pair:

φ(s, t, h) =

∫dxh exp(is4x)P (xh, t+ h|x, t), (2.33)

P (xh, t+ h|x, t) = (2π)−1

∫ds exp(−is4x)φ(s, t, h). (2.34)

Note also that:

φ(s, t, h) = 1 +∞∑

r=1

(is)r

r!< (4x)r > . (2.35)

It follows from eqn. 2.30 that

P (xh, t+ h|x0, t0) =

∫dxP (xh, t+ h|x, t)P (x, t|x0, t0)

= (2π)−1

∫ ∫dxds exp(−is4x)φ(s, t, h)P (x, t|x0, t0)

= (2π)−1

∫ ∫dxds

[1 +

∞∑

r=1

(is)r

r!< (4x)r >

]

exp(−is4x)P (x, t|x0, t0)

28

But

(2π)−1

∫ds(is)r exp(−is4x) = (− ∂

∂xh)r(2π)−1

∫ds exp(−is4x)

= (− ∂

∂xh)rδ(xh − x)

So

P (xh, t+ h|x0, t0) =

∫dx

[1 +

∞∑

r=1

1

r!(− ∂

∂xh)r < (4x)r >

]δ(xh − x)P (x, t|x0, t0)

= P (xh, t|x0, t0) +∞∑

r=1

1

r!(− ∂

∂xh)r [< (4x)r > P (xh, t|x0, t0)]

[Note that < (4x)r >≡< (xh − x)r > is the rth moment of the randomincrement xh− x in the interval [t, t+ h] given that x(t) = x. So < (4x)r >refers to the initial instant t+ h.]

Thus

h−1[P (xh, t+ h|x0, t0) − P (xh, t|x0, t0)]

=

∞∑

r=1

1

r!(− ∂

∂xh)r[h−1 < (4x)r > P (xh, t|x0, t0)

]

In the limit h→ 0 we obtain

∂

∂tP (x, t|x0, t0) =

∞∑

r=1

1

r!(− ∂

∂x)r [Kr(x)P (x, t|x0, t0)] (2.36)

where

Kr(x) = limh→0

h−1 < (4x)r > (2.37)

Eqn. 2.36 is the Kramers–Moyal expansion of the Chapman–Kolmogorovequation. In case x(t) is Gaussian as well as Markovian

K1(x) = limh→0

h−1 < 4x >= lim

h→0h−1 < xh − x >

= limh→0

h−1(µ(t+ h) − µt)

= µ,

29

K2(x) = limh→0

h−1 < (4x)2 >

= limh→0

h−1 < (xh − x)2 >

= limh→0

h−1[< (xh− < xh >)2 > + < (x− < x >)2 >

− 2 < (xh− < xh >)(x− < x >) > +O(h2)]

= limh→0

h−1[σ2(t + h) + σ2t− 2σ2t+O(h2)

]

= σ2,

Kr(x) = 0, r > 2. (2.38)

Thus eqn. 2.36 reduces to eqn. 2.24, the Fokker–Planck equation.It now follows from eqns. 2.24 and 2.25 that P (xh, t + h|x, t) can be

written using eqn. 2.28 as:

P (xh, t+ h|x, t) = [2πσ2h]−1/2 exp

[−(4x− µh)2

2σ2h

], (2.39)

a Gaussian distribution in xh with mean x + µh and variance σ2h. We thushave a random process with a drift velocity µ on which is superimposed aGaussian fluctuation with variance σ2h, which we can express roughly as:

xh = x + µh+ zσ√h (2.40)

where z(t) is a Gaussian distribution with zero mean and unit variance.Some points to note:

• The sample paths of this process are continuous since limh→0 xh = x.

• The sample paths are also nowhere differentiable since limh→0 h−14x =

limh→0 µ+ zσh−1/2.

• limh→0 h−1 < 4x >= µ, limh→0 h

−1 < (4x)2 >= σ2.

This process is in fact the Wiener–Bachelier process, also called a diffusionor Brownian motion.

Consider the process with zero drift, µ = 0, for which

4x = xh − x = zσ√h (2.41)

30

and look at n increments where t − t0 = nh. It follows from the Markovproperty that

P (xn, xn−1, xn, · · · , x1, x0) =

n−1∏

i=0

P (xi+1|xi)p(x0)

=

n−1∏

i=0

[2πσ2h]−n/2 exp

[−(4xi)2

2σ2h

]P (x0)

where

4xi = xi+1 − xi. (2.42)

But this is the joint density of P (4xn−1,4xn−2, · · · ,4x0, x0), i.e.,

• the increments 4xi are statistically independent

31

Chapter 3

Stochastic calculus

3.1 The Wiener–Bachelier process

Consider now the Wiener–Bachelier process relabelled as W (t), representedby the form:

dW ≡ W (t+ dt) −W (t) = z√dt (3.1)

What meaning, if any, can we give to this equation, in terms of calculus? Wecan rewrite eqn. 2.40 in the form:

dx(t) = µdt+ σdW (t), (3.2)

or in the integral form:

x(t) − x0 = µ(t− t0) + σ

∫ t

t0

dW (s) (3.3)

The expression ∫ t

t0

dW (s)

is a stochastic integral with respect to the sample function W (t). To interpretit we need the concept of a mean–square limit.

Let xn be a sequence of random variables. Then xn → x in the mean–square sense if

limn→∞

< (xn − x)2 >= 0. (3.4)

32

We write this asms lim

n→∞xn = x

and define the Ito stochastic integral

∫ t

t0

dW (s) ≡ ms limn→∞

n−1∑

i=0

4Wi (3.5)

where 4Wi = Wi+1 − Wi, and we have divided the interval (t0, t) into nintervals of width h. It follows that Wn = W (t),W0 = W (t0) whence

n−1∑

i=0

4Wi = Wn −W0 = W (t) −W (t0).

So in this case ∫ t

t0

dW (s) = W (t) −W (t0) (3.6)

just as in an ordinary integral, and x(t) − x0 = µ(t− t0) + σ(W (t) −W (t0))as expected.

However, consider the more complicated random process represented by:

dx(t) = K1(x)dt+√K2(x)dW (t) (3.7)

or more rigorously

x(t) − x0 =

∫ t

t0

K1(x)dt +

∫ t

t0

√K2(x)dW (t). (3.8)

Clearly we need to integrate more complicated integrals such as

∫ t

t0

√K2(x)dW (t).

Consider, for example, the Ito integral

∫ t

t0

G(s)dW (s) = ms limn→∞

n−1∑

i=0

Gi4Wi (3.9)

with G(s) = W (s).

33

Thus

∫ t

t0

W (s)dW (s) = ms limn→∞

n−1∑

i=0

Wi4Wi

= ms limn→∞

[1

2

n−1∑

i=0

[W 2i+1 −W 2

i −4W 2i ]

]

= ms limn→∞

[1

2[W 2

n −W 20 ] − 1

2

n−1∑

i=0

4W 2i

]

= ms limn→∞

[1

2[W 2(t) −W 2(t0)] −

1

2

n−1∑

i=0

4W 2i

].

But

<

n−1∑

i=0

4W 2i > =

n−1∑

i=0

< 4W 2i >

=

n−1∑

i=0

< z2 > 4ti

=

n−1∑

i=0

4ti

= nh = t− t0.

34

More exactly:

< (∑

i

(4W 2i −4ti))2 > = <

∑

i

∑

j

(4W 2i 4W 2

j − 24ti4W 2j + 4ti4tj) >

=∑

i

< 4W 4i > +

∑

i>j

< 4W 2i 4W 2

j >

− 2∑

i

∑

j

4ti < 4W 2j > +

∑

i

∑

j

4ti4tj

= 3∑

i

< 4W 2i >

2 +∑

i>j

< 4W 2i >< 4W 2

j >

− 2∑

i

∑

j

4ti < 4W 2j > +

∑

i

∑

j

4ti4tj

= 3∑

i

< z2 >2 4t2i +∑

i>j

< z2 >2 4ti4tj

− 2∑

i

∑

j

< z2 > 4ti4tj +∑

i

∑

j

4ti4tj

= (3 − 2 + 1)∑

i

4t2i + (1 − 2 + 1)∑

i>j

4ti4tj

= 2∑

i

4t2i ,

It follows from this that

limn→∞

< (∑

i

(4W 2i −4ti))2 > = lim

n→∞< (∑

i

(4W 2i − nh)2 >

= lim4ti→0

2∑

i

4t2i = 0,

i.e.

ms limn→∞

n−1∑

i=0

4W 2i = nh = t− t0.

Thus

ms limn→∞

n−1∑

i=0

Wi4Wi =1

2[W 2(t) −W 2(t0)] −

1

2(t− t0),

and ∫ t

t0

W (s)dW (s) =1

2[W 2(t) −W 2(t0)] −

1

2(t− t0). (3.10)

35

This result differs from that derived from an ordinary Riemann integral byvirtue of the extra term − 1

2(t − t0). This term appears, as we have demon-

strated, because |dW (t)| is almost always O(√dt) so that terms of second

order in 4W do not vanish in the mean square limit.

3.2 The Ito calculus

To proceed further we need the notion of a non–anticipating function. G(t)is such a function of t if for s > t, s−t = O(dt), it is statistically independentof dW (t) = W (s) −W (t). For such a G subject to the extra condition

∫ t

t0

G2(s)ds <∞, t ≥ t0,

it follows that ∫ t

t0

G(s)dW (s)

is a continuous function of the upper limit t.Consider now the following stochastic integral:

∫ t

t0

G(s)dW 2(s)

where G(t) is as defined above, and look at

I = limn→∞

< (

n−1∑

i=0

Gi(4W 2i −4ti))2 >

= limn→∞

< (

n−1∑

i=0

Gi(4W 2i −4ti))(

n−1∑

j=0

Gj(4W 2j −4ti)) >

= limn→∞

[<∑

i

G2i (4W 2

i −4ti)2 + 2∑

i>j

GiGj(4W 2i −4ti)(4W 2

j −4tj) >].

But Gi is a non–anticipating function so

< G2i (4W 2

i −4ti)2 >=< G2i >< (4W 2

i −4ti)2 >= 2 < G2i > 4t2i

36

since 4Wi is Gaussian, and for i > j

< GiGj(4W 2i −4ti)(4W 2

j −4tj) > = < GiGj4W 2i −4ti >< 4W 2

j −4tj >= 0,

since < 4W 2i >= 4ti.

ThusI = lim

n→∞2∑

i

< G2i > 4t2i

so

ms limn→∞

[∑

i

Gi4W 2i −

∑

i

Gi4ti]

= limn→∞

2∑

i

< G2i > 4t2i

= 0, ifG(t) <∞.

But

ms limn→∞

∑

i

Gi4ti =

t∫

t0

G(s)ds

so thatt∫

t0

G(s)dW 2(s) =

t∫

t0

G(s)ds. (3.11)

We write this in the language of stochastic differentials as:

dW 2(t) = dt. (3.12)

We can also show using similar methods that

dW 2+n(t) = 0, n > 0. (3.13)

These equations are the key to the Ito calculus.

3.2.1 Rules for stochastic differentiation

Consider now the function f [W (t), t] and Taylor expand it with respect toboth W (t) and t, to obtain

df [W (t), t] =∂f

∂tdt+

1

2

∂2f

∂t2dt2 + · · ·

∂f

∂WdW (t) +

1

2

∂2f

∂W 2d2W (t) + · · ·

37

All higher order terms vanish via the relations dW (t)dt = 0, dW 2+n(t) = 0and d2W (t) = dt so that if we truncate to second order we are left with theequation:

df [W (t), t] =

[∂f

∂t+

1

2

∂2f

∂W 2

]dt+

∂f

∂WdW (t). (3.14)

3.2.2 Ito’s lemma

Now suppose x(t) satisfies the stochastic differential equation:

dx(t) = K1(x)dt+√K2(x)dW (t) (3.15)

and let f [x(t)] be an arbitrary function of the random variable x(t). Thenas above

df [x(t)] = f ′[x(t)]dx(t) +1

2!f ′′[x(t)]d2x(t) + · · ·

= f ′[x(t)][K1(x)dt +

√K2(x)dW (t)

]+

+1

2!f ′′[x(t)]

[K1(x)dt

√K2(x)dW (t)

]2+ · · ·

= f ′[x(t)][K1(x)dt +

√K2(x)dW (t)

]+

1

2!f ′′[x(t)]K2(x)d

2W (t)

+ O(dt2).

It follows from eqn. 3.12 that we are left with the equation:

df [x(t)] =

[K1(x)f

′[x(t)] +1

2K2(x)f

′′[x(t)]

]dt+

√K2(x)f

′[x(t)]dW (t).

(3.16)This is Ito’s lemma or formula, in which the ordinary rules of calculus areextended to deal with the effects of Brownian motion.

38

3.2.3 From stochastic differential equations to the Fokker–

Planck equation

Consider the time evolution of the arbitrary function f [x(t)]. We can writeit in the form:

d

dt< f [x(t)] > = <

d

dtf [x(t)] >=< df [x(t)] > /dt

= < K1(x)f′[x(t)] +

1

2K2(x)f

′′[x(t)] >

+ <√K2(x)f

′[x(t)]dW (t) > /dt

= < K1(x)f′[x(t)] +

1

2K2(x)f

′′[x(t)] >

= < K1(x)∂f

∂x+

1

2K2(x)

∂2f

∂x2> .

But

d

dt< f [x(t)] > =

d

dt

∫dxf [x(t)]P (x, t|x0, t0)

=

∫dx

∂

∂t(f [x(t)]P (x, t|x0, t0))

=

∫dxf

∂

∂tP (x, t|x0, t0).

so∫dxf

∂

∂tP (x, t|x0, t0) =

∫dx

[K1(x)

∂f

∂x+

1

2K2(x)

∂2f

∂x2

]P (x, t|x0, t0)

We need to integrate by parts the right hand side using Green’s first andsecond identities, in the form:

∫

R

dx

(u∂v

∂x+ v

∂u

∂x

)= uv]∂R

∫

R

dx

(u∂2v

∂x2− v

∂2u

∂x2

)= u

∂v

∂x− v

∂u

∂x]∂R (3.17)

39

Thus∫

R

dxK1(x)P (x, t|x0, t0)∂f

∂x= K1(x)P (x, t|x0, t0)f ]∂R

−∫

R

dxf∂

∂x[K1(x)P (x, t|x0, t0)],

and∫

R

dxK2(x)P (x, t|x0, t0)∂2f

∂x2= K2(x)P (x, t|x0, t0)

∂f

∂x]∂R

−f ∂

∂x(K2(x)P (x, t|x0, t0))]∂R

+

∫

R

dxf∂2

∂x2(K2(x)P (x, t|x0, t0)).

We now assume that

P (x, t|x0, t0) = 0, x(t), x0 ∈ ∂R. (3.18)

In such a case all the surface terms vanish and we are left with:

∫dxf

∂

∂tP (x, t|x0, t0)

=

∫dx

[K1(x)

∂f

∂x+

1

2K2(x)

∂2f

∂x2

]P (x, t|x0, t0)

=

∫dxf

[− ∂

∂x(K1(x)P (x, t|x0, t0)) +

1

2

∂2

∂x2(K2(x)P (x, t|x0, t0))

],

But f is an arbitrary function, hence:

∂

∂t= − ∂

∂x(K1(x)P (x, t|x0, t0)) +

1

2

∂2

∂x2(K2(x)P (x, t|x0, t0)).

We recognize this equation as a form of the Fokker–Planck equation (eqn 2.24)for the Wiener–Bachelier process starting from the initial state x(0) = x0 ∈ Rwith drift K1(x) and variance K2(x). This is a key result.

40

3.2.4 Examples

Suppose, for example, that f = x,K1(x) = µx,K2(x) = σ2x2. Then eqn. 3.15takes the form:

dx

x= µdt+ σdW (3.19)

and the corresponding Fokker–Planck equation takes the form:

∂P

∂t= − ∂

∂x(µxP ) +

1

2

∂2

∂x2(σ2x2P ). (3.20)

Now let y = log x, then Ito’s lemma gives:

dy = (µx.x−1 +1

2σ2x2.− x−2)dt+ σx.x−1dW

= (µ− 1

2σ2)dt+ σdW (3.21)

and the corresponding Fokker–Planck equation is

∂P

∂t= − ∂

∂y((µ− 1

2σ2)P ) +

1

2

∂2

∂y2(σ2P ). (3.22)

41

Chapter 4

The Fokker–Planck equation

The general Fokker–Planck equation is:

∂

∂tf(x, t) = − ∂

∂x[K1(x, t)f(x, t)] +

1

2

∂2

∂x2[K2(x, t)f(x, t)] (4.1)

This equation was derived for

f(x, t) = P (x, t|x0, t0) (4.2)

subject to the initial condition

P (x, t0|x0, t0) = δ(x− x0). (4.3)

but

P (x, t) =

∫dx0P (x, t; x0, t0) =

∫dx0P (x, t|x0, t0)P (x0, t0)

so equation 4.1 also holds for

f(x, t) = P (x, t) (4.4)

with initial condition

P (x, t) → P (x0, t0)

t → t0 (4.5)

42

4.1 Boundary conditions

In what follows we assume that the coefficients Ki(x, t) are stationary, i.e.Ki(x, t) = Ki(x). Let:

J(x, t) = K1(x)f(x, t) − 1

2

∂

∂x[K2(x)f(x, t)] (4.6)

Then eqn. 4.1 can be written in the form:

∂

∂tf(x, t) +

∂

∂xJ(x, t) = 0. (4.7)

Now consider an interval R on the line with boundary S = ∂R, and let

F (t) =

∫

R

dxf(x, t). (4.8)


d

dtF (t) = −J(x, t)]S (4.9)

thus the total loss of probability f out of R across S is given by

J(S1, t) − J(S2, t)

where S1 and S2 are the boundaries of R. We can interpret J to be the prob-

ability current associated with f and eqn. 4.7 as describing the conservationof probability.

We use these definitions to describe the following boundary conditions tobe applied to eqn. 4.1.

• reflecting : If there is no net flow across S then J(S, t) = 0 so as x → Sfrom within R it must be “reflected” at S.

• absorbing : If f(S, t) = 0 then as x→ S from within R, then since f iscontinuous it must decline smoothly to zero and the state “x” is saidto be absorbed at S.

• free: If S = g(t) where g is an unknown function of t then eqn. 4.1does not have a unique solution for f(x, t) unless extra constraints are

43

supplied. Thus if S is a free absorbing boundary such that f(g(t), t) = 0it follows from the chain rule that:

∂

∂tf(g(t), t) +

∂

∂xf(g(t), t)g′(t) = 0

Suppose we pick∂

∂xf(g(t), t) = −g′(t)

then∂

∂xf(g(t), t) = (g′(t))2 > 0.

These equations are sufficient to determine g and hence f uniquely,given appropriate initial and other boundary conditions: e.g.: f(0, t) =f0, f(g(t), t) = 0, fx(g(t), t) = −g′, t > 0.

4.2 Forward and backward equations

Consider again the binomial random walk, but instead of the recursion givenin eqn. 2.22 we use the following:

P (x, t|x0, t0 − h) = pP (x, t|x0 + log u, t0) + qP (x, t|x0 − log u, t0) (4.10)

where x = logS and 4x = log u, and Taylor expand both sides of thisequation about (x0, t0) to obtain:

[1 − h

∂

∂t0+h2

2!

∂2

∂t20+ · · ·

]P (x, t|x0, t0)

=

[1 − (q − p) log u

∂

∂x0+

1

2!log2 u

∂2

∂x20

+ · · ·]P (x, t|x0, t0)

whence from eqn. 2.23, the right–hand–side reduces to:

[1 + µh

∂

∂x0

+1

2σ2h

∂2

∂x20

+ · · ·]P (x, t|x0, t0),

so that as h→ 0, the limit of eqn. 4.10 is:

− ∂

∂t0P (x, t|x0, t0) =

(µ∂

∂x0

+1

2σ2 ∂

2

∂x20

)P (x, t|x0, t0). (4.11)

44

This is the Kolmogorov equation for the Wiener–Bachelier process. It is theadjoint of eqn. 2.24 the Fokker–Planck equation derived earlier. Eqn. 2.24 issometimes called the forward equation for the process, since we start fromtime t and look at the probability distribution at time t+h, whereas eqn. 4.11is called the backward equation since we start from time t0 and look at thedistribution at time t0 − h.

A backward equation corresponding to the eqn. 2.36 can also be derived.We start from eqn. 2.30, the Chapman–Kolmogorov equation in the form:

P (x, t|y, t′ − h) =

∫dzP (x, t|z, t′)P (z, t′|y, t′ − h) (4.12)

However we can also write

P (x, t|y, t′) = P (x, t|y, t′)∫dzP (z, t′|y, t′ − h)

=

∫zP (x, t|y, t′)P (z, t′|y, t′ − h) (4.13)

since∫dzP (z, t′|y, t′ − h) = 1. It follows that:

h−1 [P (x, t|y, t′ − h) − P (x, t|y, t′)]

= h−1

∫dzP (z, t′|y, t′ − h) [P (x, t|z, t′) − P (x, t|y, t′)]

We now use Taylor’s theorem in the form:

P (x, t|z, t′) =

∞∑

k=0

(z − y)k

k!(∂

∂y)kP (x, t|y, t′) (4.14)

whence

h−1 [P (x, t|y, t′ − h) − P (x, t|y, t′)]

=

∞∑

k=1

h−1 < (4y)k >k!

(∂

∂y)kP (x, t|y, t′).

In the limit h→ 0 we obtain:

− ∂

∂t′P (x, t|y, t′) =

∞∑

r=1

Kr(y)

r!(∂

∂y)rP (x, t|y, t′) (4.15)

45

This equation is adjoint to eqn. 2.36. Solutions to it exist for t′ ≤ t subjectto the condition:

P (x, t|y, t) = δ(x− y) (4.16)

for all t.The Kolmogorov equation can be rewritten as:

− ∂

∂tP (x′, t′|x, t) =

[K1(x)

∂

∂x+

1

2K2(x)

∂2

∂x2

]P (x′, t′|x, t)

= A∗xP (x′, t′|x, t) (4.17)

where A∗x is the adjoint of the differential operator

Ax = − ∂

∂xK1(x) +

1

2

∂2

∂x2K2(x) (4.18)

in the sense that

(u,Axv) ≡b∫

a

u(x)Ax[v(x)]dx

=

b∫

a

A∗x[u(x)]v(x)dx

= (A∗xu, v) (4.19)

under appropriate boundary conditions.It follows that Ito’s lemma, eqn. 3.16, may be rewritten in the more

concise form:

df = A∗xfdt+

√K2(x)

∂f

∂xdW. (4.20)

It also follows from this that:

< df > = < A∗xfdt+

√K2(x)

∂f

∂xdW >

= < A∗xfdt >

= < A∗xf > dt

or

< A∗xf >=<

df

dt>

46

In fact we can define A∗xf as:

limh→0

< f(x, h) − f(x, 0) >, (4.21)

the infinitesimal generator of the process. This can be seen as an extensionof the definition of the drift coefficient µ of a Gaussian Markov process X,to the drift coefficient of the process f(X).

4.2.1 Boundary conditions for the backward equation

Suppose P (x, t|x′, t′) satisfies the forward equation in R. It can be shownusing the Chapman–Kolmogorov equations and integration by parts, that:

P (x, t|y, s)[−K1(y)P (y, s|x′, t′) +

1

2

∂

∂y(K2(y)P (y, s|x′, t′))

]]S

=1

2P (y, s|x′, t′)K2(y)

∂

∂yP (x, t|y, s)]S (4.22)

Consider now the following boundary conditions

• absorbing :P (y, s|x′, t′) = 0, y ∈ S (4.23)

so eqn. 4.22 reduces to:

P (y, s|x′, t′)K2(y)∂

∂yP (x, t|y, s)]S = 0

whenceP (x, t|y, s) = 0, y ∈ S (4.24)

i.e. the probability of re–entering R from S is zero.

• reflecting : Eqn. 4.6 and the reflecting boundary condition J(S, t) = 0imply that:

J(y, s) = K1(y)P (y, s|x′, t′) − 1

2

∂

∂y[K2(y)P (y, s|x′, t′)]

= 0, y ∈ S,

whence eqn. 4.22 reduces to:

P (y, s|x′, t′)K2(y)∂

∂yP (x, t|y, s)]S = 0

47

The right–hand side of this equation vanishes for arbitrary P (y, s|x′, t′)if and only if:

∂

∂yP (x, t|y, s) = 0, y ∈ S. (4.25)

i.e. the flux of probability entering R from S is zero.

4.3 Stationary solutions

In case Ki(x, t) → Ki(x) a stationary solution of eqn. 4.7 exists as:

∂

∂xJ(x) = 0 (4.26)

whenceJ(x) = J. (4.27)

It follows that on the finite domain a ≤ x ≤ b

J(a) = J(x) = J(b) = J. (4.28)

If one of the boundaries is reflecting both are and J = 0. If the boundariesare not reflecting then they must be periodic. It follows from eqn. 4.1 andthe above that:

∂

∂x[K1(x)Pst(x)] −

1

2

∂2

∂x2[K2(x)Pst(x)] = 0

whence

K1(x)Pst(x) −1

2

∂

∂x[K2(x)Pst(x)] = J = 0.

This generates an ordinary differential equation:

[K2(x)Pst(x)]′ − 2K1(x)Pst(x) = 0. (4.29)

Letψ(x) = K2(x)Pst(x) (4.30)

then eqn. 4.29 becomes

ψ′(x) − 2K1(x)

K2(x)ψ(x) = 0,

48

the solution of which is

ψ(x) = N exp

2

x∫

a

dyK1(y)

K2(y)

whence:

Pst(x) =N

K2(x)exp

2

x∫

a

dyK1(y)

K2(y)

, (4.31)

and N is a normalization constant such that:

b∫

a

Pst(x)dx = 1. (4.32)

This solution for Pst(x) is sometimes called the potential solution.

4.4 Non–stationary solutions. Eigenvalues and

Eigenfunctions.

In general, except for Gaussian processes, the non–stationary problem ismuch harder to solve. In some cases it can be solved by separation of vari-ables; i.e. we look for solutions of the form:

f(x, t) = X(x)T (t) (4.33)

and substitute this into eqn. 4.1 to obtain:

X(x)∂

∂tT (t) = −T (t)

∂

∂x[K1(x)X(x)] + T (t)

1

2

∂2

∂x2[K2(x)X(x)]

assuming K1 and K2 to be time–independent. It follows that

1

T (t)

∂

∂tT (t) =

1

X(x)

[− ∂

∂x[K1(x)X(x)] +

1

2

∂2

∂x2[K2(x)X(x)]

].

Such an equality can hold if and only if both sides equal a constant, whichwe take to equal −λ. This generates the ordinary differential equations:

T (t) + λT (t) = 0 (4.34)

49

and1

2[K2(x)X(x)]′′ − [K1(x)X(x)]′ + λX(x) = 0 (4.35)

Eqn. 4.34 requires one initial condition, and eqn. 4.35 two boundary condi-tions. Eqn. 4.35 is a version of the Sturm–Liouville equation, and definesan eigenvalue problem. For various boundary conditions, e.g., absorbing orreflecting, eqn. 4.35 has solutions comprising a sequence of eigenfunctionsX0(x), X1(x), · · · belonging to the eigenvalues λ0, λ1, · · · .

Suppose therefore that

f(x, t) = Pst(x)b(s, t) (4.36)

where f(x, t) satisfies the forward equation. Thus

∂

∂t[Pst(x)b(s, t)] = Pst(x)

∂

∂tb(s, t)

= − ∂

∂x[K1(x)Pst(x)b(x, t)] +

1

2

∂2

∂x2[K2(x)Pst(x)b(x, t)]

= − ∂

∂x[K1(x)Pst(x)]b(x, t) − [K1(x)Pst(x)]

∂

∂xb(x, t)

+∂

∂x[K2(x)Pst(x)]

∂

∂xb(x, t) +

1

2

∂2

∂x2[K2(x)Pst(x)]b(x, t)

+1

2[K2(x)Pst(x)]

∂2

∂x2b(x, t)

=

[− ∂

∂x[K1(x)Pst(x)] +

1

2

∂2

∂x2[K2(x)Pst(x)]

]b(x, t)

+

[−[K1(x)Pst(x)] +

∂

∂x[K2(x)Pst(x)]

]∂

∂xb(x, t)

+1

2[K2(x)Pst(x)]

∂2

∂x2b(x, t).

but

− ∂

∂x[K1(x)Pst(x)] +

1

2

∂2

∂x2[K2(x)Pst(x)] = 0,

and for reflecting boundary conditions:

−[K1(x)Pst(x)] +1

2

∂

∂x[K2(x)Pst(x)] = 0

50

Using these conditions we find that:

∂

∂tb(x, t) =

[K1(x)

∂

∂x+

1

2K2(x)

∂2

∂x2

]b(x, t) (4.37)

This is a form of the backward equation. Thus if f(x, t) satisfies the forwardequation, then b(x, t) satisfies the backward equation. Now let

f(x, t) = Fλ(x) exp(−λt), b(x, t) = Bλ(x) exp(−λt). (4.38)

On applying these to eqns. 4.1 and 4.37 we obtain the eigenvalue equations:

−[K1(x)Fλ(x)]′ +

1

2[K2(x)Fλ(x)]

′′ = −λFλ(x)

K1(x)B′

λ(x) +1

2K2(x)B

′′

λ(x) = −λBλ(x) (4.39)

It follows using partial integration that

(λ′ − λ)

b∫

a

dxFλ(x)Bλ′(x) =

[Bλ′(x)[−K1(x)Fλ(x) +

1

2

∂

∂x[K2(x)Fλ(x)]] −

1

2K2(x)Fλ(x)

∂

∂xBλ′(x)

]b

a

.

But from eqns. 4.36 and 4.38 it follows that:

Fλ(x) = Pst(x)Bλ(x) (4.40)

whence one can show, given reflecting boundary conditions, that:

1

2K2(x)Fλ(x)

∂

∂xBλ′(x) = Bλ(x)[−K1(x)Fλ′(x) +

1

2[K2(x)Fλ′(x)]],

so that

(λ′ − λ)

b∫

a

dxFλ(x)Bλ′(x) = [Bλ′(x)Jλ(x) − Bλ(x)Jλ′(x)]ba .

On applying reflecting boundary conditions it follows that:

b∫

a

dxFλ(x)Bλ′(x) = δλλ′ (4.41)

51

i.e., Fλ(x) and Bλ(x) are bi–orthogonal.Alternatively, it follows from eqn. 4.38 that:

b∫

a

dxPst(x)Bλ(x)Bλ′(x) = δλλ′ ,

b∫

a

dxP−1st (x)Fλ(x)Fλ′(x) = δλλ′ . (4.42)

Note that if λ = λ′ = 0 then

F0(x) = Pst(x), B0(x) = 1. (4.43)

4.4.1 Eigenfunction expansions

We can now write an eigenfunction expansion for arbitrary f(x, t) and re-flecting boundary conditions, in the form:

f(x, t) =∑

λ

AλFλ(x) exp(−λt), Aλ =

b∫

a

Bλ(x)f(x, 0)dx (4.44)

Thus if p(x, 0|x0, 0) = δ(x− x0) then

Aλ =

b∫

a

Bλ(x)δ(x− x0)dx = Bλ(x0)

soP (x, t|x0, 0) =

∑

λ

Fλ(x)Bλ(x0) exp(−λt). (4.45)

Absorbing boundary conditions give rise to similar expansions. Let Fλand Bλ be defined as above (with Pst(x) still satisfying reflecting boundaryconditions). Then at the boundaries a and b

Fλ(a) = Fλ(b) = Bλ(a) = Bλ(b) = 0 (4.46)

and all results go through except that λ = 0 is excluded, so that P (x, t|x0, 0) →0 as t→ ∞.

52

4.5 Examples

Consider again the Wiener–Bachelier process with µ = 0, and σ2 = 1.Eqn. 2.24 then reduces to:

∂

∂tP (x, t|x0, 0) =

1

2

∂2

∂x2P (x, t|x0, 0) (4.47)

with initial conditionP (x, 0|x0, 0) = δ(x− x0) (4.48)

The characteristic function for this process, eqn. 2.17, reduces to:

φ(s, t) = exp[isx0 −1

2s2t] (4.49)

whence

P (x, t|x0, 0) =1√2πt

exp

[−(x− x0)

2

2t)

](4.50)

This solution is such that:

∞∫

−∞

dxP (x, t|x0, 0) = 1, limx→±∞

P (x, t|x0, 0) = 0.

4.5.1 Finite domains

Suppose the domain is 0 ≤ x ≤ 1.(a) Absorbing boundary conditions:

P (0, t) = P (1, t) = 0 (4.51)

Let

P (x, t) =

∞∑

n=1

cn(t) sin(nπx), P (x, 0) = δ(x− x0). (4.52)

It follows that:

cn(0) = 2

2∫

0

dxδ(x− x0) sin(nπx) = 2 sin(nπx0) (4.53)

53

On substituting eqn. 4.53 into eqn. 4.47 we obtain:

d

dtcn(t) = −λncn(t), λn =

(nπ)2

2(4.54)

whencecn(t) = cn(0) exp(−λnt) (4.55)

and the solution is therefore:

P (x, t|x0, 0) = 2

∞∑

n=1

sin(nπx) sin(nπx0) exp(−1

2(nπ)2t) (4.56)

Note thatPst(x) = lim

t→∞P (x, t|x0, 0) = 0

andFλ(x) =

√2 sin(nπx), Bλ(x) =

√2 sin(nπx) = Fλ(x)

as expected since the eigenvalue equation derived from eqn. 4.47 is self–adjoint.

(b) Reflecting boundary conditions:

J = −1

2

∂

∂xP (0, t) = −1

2

∂

∂xP (1, t) = 0 (4.57)

Let

P (x, t) =1

2a0 +

∞∑

n=1

an(t) cos(nπx), P (x, 0) = δ(x− x0) (4.58)

Then

an(0) = 2

1∫

0

dxδ(x− x0) cos(nπx) = 2 cos(nπx0) (4.59)

leading to

an(t) = an(0) exp(−1

2(nπ)2) (4.60)

and

P (x, t|x0, 0) = 1 + 2∞∑

n=1

cos(nπx) cos(nπx0) exp(−1

2(nπ)2) (4.61)

54

so thatPst(x) = lim

t→∞P (x, t|x0, 0) = 1 (4.62)

In this caseFλ(x) =

√2 cos(nπx) = Bλ(x).

55

Chapter 5

Distributions and Green’s

functions

We now make a bit more precise the nature of the Dirac delta “function”.

• The “delta function” is a rule that assigns the number φ(0) to the test

function φ(x).

• A test function φ(x) is a real function, all of whose derivatives exist (i.e.φ(x) is a C∞ function), and which vanishes outside a finite interval. So

φ : R → R

is defined and differentiable for all x, and

φ(x) ≡ 0, for |x| large. (5.1)

• A distribution is a functional (or rule)

f : D → R

which is linear and continuous.

• linearity means

(f, aφ+ bψ) = a(f, φ) + b(f, ψ) (5.2)

for all constants a, b and test functions φ, ψ.

56

• continuity means that if φn is a sequence of test functions that finishoutside the interval [a, b] and converge uniformly to a test function φ,and if all their derivatives do as well, then:

(f, φn) → (f, φ) (5.3)

• we specify a distribution f by writing

φ 7→ (f, φ)

• note that any ordinary integrable function f(x) corresponds to thedistribution

φ 7→∞∫

−∞

f(x)φ(x)dx = (f, φ)

Because of this it is common to use the notation

∞∫

−∞

δ(x)φ(x)dx = φ(0)

and to refer to δ(x) as if it were an ordinary function.

• the “delta function” is the distribution

δ : φ 7→ φ(0)

We write this as:

(δ, φ) =

∞∫

−∞

δ(x)φ(x)dx = φ(0). (5.4)

• If fN is a sequence of distributions, and f is another distribution, thenfN converges weakly to f if

(fN , φ) → (f, φ) as N → ∞

for all test functions φ.

57

The solution to eqn. 4.47 for P (x, 0|x0, 0) = δ(x− x0) is

P (x, t|x0, 0) =1√2πt

exp

[−(x− x0)

2

2t

](5.5)

Evidentlylimt0

P (x, t|x0, 0) = P (x, 0|x0, 0)

Thus

∞∫

−∞

dxP (x, t|x0, 0)φ(x) →∞∫

−∞

dxP (x, 0|x0, 0)φ(x) =

∞∫

−∞

dxδ(x−x0)φ(x) = φ(x0).

It follows that in the sense of distributions:

P (x, t|x0, 0) → δ(x− x0), (weakly). (5.6)

5.1 The derivative of a distribution

Suppose f(x) ∈ C1, then:

∞∫

∞

f ′(x)φ(x)dx = f(x)φ(x)]∞−∞ −∞∫

−∞

f(x)φ′(x)dx = −∫ ∞

−∞f(x)φ′(x)dx

since φ(x) = 0 for |x| large.We define the derivative f ′ of a distribution by the formula:

(f ′, φ) = −(f, φ′) (5.7)

for all φ.If follows from this that:

(δ′, φ) = −(δ, φ′) = −φ′(0)

(δ′′, φ) = −(δ′, φ′) = (δ, φ′′) = φ′′(0)

i.e.(δ(n), φ) = (−1)nφ(n)(0) (5.8)

58

The Heaviside function (or step function) is defined as:

H(x) =

1 if x > 00 if x < 0

(5.9)

For any φ:

(H ′, φ) = −(H, φ′) −∞∫

−∞

H(x)φ′(x)dx

= −∞∫

0

φ′(x) = φ(x)]0∞

= φ(0)

ThusH ′(x) = δ(x) (5.10)

It follows immediately from eqn. 5.4 with φ(x) = 1 that (δ, 1) = 1, i.e.:

∞∫

−∞

δ(x)dx = 1 (5.11)

5.2 Green’s functions

Since

P (x, t) =

∫dx0P (x, t|x0, 0)P (x0, 0)

it follows from eqn. 4.50 that

P (x, t) =

∞∫

−∞

dx01√2πt

exp

[−(x− x0)

2

2t

]P (x0, 0)

i.e. the solution of eqn. 4.47 with P (x, 0) = φ(x) is:

P (x, t) =

∞∫

−∞

dx′1√2πt

exp

[−(x− x′)2

2t

]φ(x′) (5.12)

59

for any test function φ(x), and the function

S(x, t) =1√2πt

exp

[−x

2

2t

](5.13)

converges weakly to δ(x) as t 0.S(x, t) is called the Green’s function of eqn. 4.47 on −∞ < x < ∞ and

is itself a solution with initial condition P (x, 0) = δ(x) on −∞ < x <∞, fort > 0. Note the following:

• S(x, t) = p(x, t|0, 0)

• Equation 5.12 can be rewritten as the convolution:

P (x, t) =

∞∫

−∞

dx′S(x− x′, t)φ(x′) ≡ (S ∗ φ)(x, t) (5.14)

• Eqn. 4.47 is completely characterized by S(x, t).

5.2.1 Some symmetries of solutions of the Wiener–

Bachelier equation

1. The translate P (x− y, t) of any solution P (x, t) is a solution.

2. Any derivative Px, Pxx, Pt, · · · is a solution.

3. If P1 and P2 are solutions, so is any linear combination aP1 + bP2.

4. A limiting form of such linearity is that S ∗ φ is a solution.

5. Let Q(x, t) = P (√ax, at), for a > 0. Then Qxx = aPxx, Qt = aPt, so if

P (x, t) is a solution so is Q(x, t).

Thus P (x, t) remains a solution of eqn. 4.47 under x→ x± y; x→ √ax, t→

at; and under differentiation, integration and convolution.

60

5.2.2 Further properties of solutions of the Wiener–

Bachelier equation

• Suppose P (x, 0) = H(x), then:

P (x, t) = S(x, t) ∗H(x)

=

∞∫

−∞

S(x− x′, t)H(x′)dx′ =

∞∫

−∞

S(y, t)H(x− y)dy

=

x∫

−∞

S(y, t)dy =

x∫

−∞

1√2πt

exp

[−y

2

2t

]dy

=1√π

x/√

2t∫

−∞

exp[−z2

]dy

where z = y/√

2t. Let

2√π

x∫

0

exp[−z2

]dz ≡ Erf(x) (5.15)

thenErf(0) = 0, Erf(∞) = 1 (5.16)

so

P (x, t) =1

2+

1

2Erf

(x√2t

). (5.17)

• Since H ′(x) = δ(x) we expect by symmetry 2 that:

∂

∂x

[1

2+

1

2Erf

(x√2t

)]= S(x, t)

as is the case.

• Let y = µt. Then by symmetry 1 S(x − µt, t) is a solution to someWiener–Bachelier equation with initial condition δ(x). Let

x′ = x− µt, t′ = t (5.18)

61

Then S(x’,t’) satisfies the Wiener–Bachelier equation:

∂

∂t′P (x′, t′) =

1

2

∂2

∂x′2P (x′, t′).

But from eqn. 5.18 it follows that:

∂

∂t′P (x′, t′) =

∂

∂t′P (x′(x, t), t′(x, t)) =

∂P

∂x

∂x

∂t′+∂P

∂t

∂t

∂t′= µ

∂P

∂x+∂P

∂t

and

∂

∂x′P (x′, t′) =

∂

∂x′P (x′(x, t), t′(x, t)) =

∂P

∂x

∂x

∂x′+∂P

∂t

∂t

∂x′=∂P

∂x

whence∂

∂tP (x, t) = − ∂

∂x[µP (x, t)] +

1

2

∂2

∂x2P (x, t). (5.19)

Thus the solution of eqn. 5.19 with P (x, 0) = δ(x) is:

S(x− µt, t) = S(x, t) =1√2πt

exp

[−(x− µt)2

2t

]

as expected.

• We can use the symmetries above to solve the Wiener–Bachelier equa-tion directly for S(x, t). We look first for a solution P (x, t) in the form

P (x, t) ≡ g(z) (5.20)

where z = x/√

2t. Clearly g(z) is invariant under the dilatations x →√ax, t→ at, as is eqn. 4.47. But

∂P

∂t=

dg

dz· ∂z∂t

= − z

2tg′(z)

∂P

∂x=

dg

dz· ∂z∂x

=1

2tg′(z)

So eqn. 4.47 becomes:

− z

2tg′(z) =

1

2· 1

2tg′′(z)

i.e.g′′(z) + 2zg′(z) = 0. (5.21)

62

This ordinary differential equation can be solved using the integratingfactor exp(

∫2zdz) = exp z2, whence:

g′(z) = c1 exp(−z2)

so

g(z) = c1

∫exp(−z2)dz + c2

where c1 and c2 are constants. Thus

P (x, t) = c1

x/√

2t∫

0

exp(−z2)dz + c2 = c1

√π

2Erf

(x√2t

)+ c2. (5.22)

To determine the constants c1 and c2 we assume that P (x, 0) = H(x), i.e.:

If x > 0, limt0

P (x, t) = 1, if x < 0, limt0

P (x, t) = 0 (5.23)

Then

1 = c1

√π

2Erf(∞) + c2 = c1

√π

2+ c2

0 = c1

√π

2Erf(−∞) + c2 = −c1

√π

2+ c2

so c1 = 1/√π, c2 = 1/2 and

P (x, t) =1

2+

1

2Erf

(x√2t

)(5.24)

This is the response to initial data H(x). Evidently the response to δ(x)is just

∂

∂xP (x, t) =

1√2πt

exp

(−x

2

2t

)

as expected.This method of reducing a partial differential equation with two boundary

conditions and one initial condition to an ordinary differential equation withtwo initial conditions will only work if two of the three partial differentialequation’s conditions coalesce. This can occur only if the equation has a

63

certain symmetry, namely invariance under some dilatation, i.e., under asimilarity transformation in which

(x, t) →√ax√2at

=x√2t

independent of the dilatation parameter a. Solutions obtained in such amanner were discovered by Birkhoff for a broad class of both linear andnonlinear partial differential equations, and are called similarity solutions.

• The diffusion equation

∂P

∂t− k

∂2P

∂x2+ bp = 0 (5.25)

on −∞ < x <∞, with P (x, 0) = φ(x) can be reduced to the standardform of equation 4.47 via the change of variables

P (x, t) = exp(−bt)Q(x, t) (5.26)

• There is another way to solve eqn. 4.47 directly for the initial condi-tion P (x, t|0, 0) = S(x, t) using Fourier transforms. Given the partialdifferential equation

∂S

∂t=

1

2

∂2S

∂x2

with S(x, 0) = δ(x), let

S(k, t) =

∞∫

−∞

S(x, t) exp(−ikx)dx (5.27)

be the Fourier transform of S(x, t).

It follows from the properties of the Fourier transform that

∂

∂tS(k, t) =

1

2(ik)2S(k, t) = −1

2k2S(k, t) (5.28)

This is an ordinary differential equation with solution

S(k, t) = exp(−1

2k2t) (5.29)

64

But the inverse Fourier transform of S(k, t) is

S(x, t) =1

2π

∞∫

−∞

S(k, t) exp(ikx)dk (5.30)

It is easy to show that the transform of e−x2/2/

√2π = e−k

2/2 and thatof e−x

2t/2/√

2π =√te−k

2t/2, so that

S(x, t) =1√2πt

exp

[−x

2

2t

]. (5.31)

• A more general Green’s function can also be defined. Let

R(x, t) ≡ S(x− x0, t− t0), t > t0

≡ 0, t < t0 (5.32)

Then R(x, t) satisfies the inhomogeneous diffusion equation

∂R

∂t− 1

2

∂2R

∂x2= δ(x− x0)δ(t− t0) (5.33)

0n −∞ < x <∞,−∞ < t <∞.

5.3 The relationship between Green’s func-

tions and eigenfunctions

The Green’s function S(x, x0, t) is defined as the solution of the partial dif-ferential equation

∂S

∂t=

1

2

∂2S

∂x2, x ∈ D (5.34)

with S = 0 on x ∈ ∂D and S = δ(x−x0) for t = 0 on some bounded domainD.

Let P (x, t) be the solution of the same (Dirichlet) problem, but with theinitial condition P (x, 0) = φ(x). Let λn and Xn(x) be the eigenvalues and

65

their (normalized) eigenfunctions for the domain D. Then eqns. 4.44 give:

P (x, t) =

∞∑

n=1

AnXn(x) exp(−λnt)

=

∞∑

n=1

[

∫

D

Xn(x′)φ(x′)dx′]Xn(x) exp(−λnt)

=

∫

D

[

∞∑

n=1

Xn(x)Xn(x′) exp(−λnt)]φ(x′)dx′ (5.35)

assuming that the interchanges of∫D

and∞∑n=1

can be justified.

But since S(x, x0, t) is by definition the Green’s function for the domain D,

P (x, t) =

∫

D

S(x, x′, t)φ(x′)dx′ (5.36)

whence

S(x, x0, t) =

∞∑

n=1

Xn(x)Xn(x0) exp(−λnt) (5.37)

provided this series converges. [cf. eqn. 4.45.]

66

Chapter 6

Path integrals

Consider again the Fokker–Planck equation for a Gaussian random processwith zero drift and variance σ2:

∂

∂tP (x, t|x0, t0) =

σ2

2

∂2

∂x2P (x, t|x0, t0) (6.1)

with initial condition:

limtt0

P (x, t|x0, t0) = δ(x− x0). (6.2)

As we have shown the Green’s function solution for this equation is

P (x, t|x0, t0) =1√

2πσ2(t− t0)exp

[− (x− x0)

2

2σ2(t− t0)

]. (6.3)

We now (again) divide the interval (t0, t) this time into n + 1 intervalsof width h, just as we did in defining a stochastic integral, so that x0 =x(t0), · · · , xn+1 = x(t). It follows from eqns. 2.39-2.42 that

P (xn+1, xn, xn−1, · · · , x1|x0) =n∏

i=0

P (xi+1|xi)

=

n∏

i=0

[2πσ2h

]−n+1

2 exp

[−(xi+1 − xi)

2

2σ2h

]

=[2πσ2h

]−n+1

2 exp

[− 1

2σ2h

n∑

i=0

(xi+1 − xi)2

](6.4)

67

thus

P (xn+1, xn, xn−1, · · · , x1|x0)

n∏

i=1

dxi =

[2πσ2h

]−n+1

2 exp

[− 1

2σ2h

n∑

i=0

(xi+1 − xi)2

]n∏

i=1

dxi (6.5)

and we note that in the limit h→ 0, n→ ∞, (n+ 1)h→ t− t0.By integrating over x1 to xn we are left with the expression

P (xn+1|x0) =

∞∫

−∞

dx1

∞∫

−∞

dx2 · · ·∞∫

−∞

dxnP (xn+1, xn, xn−1, · · · , x1|x0)

=[2πσ2h

]−n+1

2

∞∫

−∞

dx1

∞∫

−∞

dx2 · · ·∞∫

−∞

dxn exp

[− 1

2σ2h

n∑

i=0

(xi+1 − xi)2

].

In the limit h → 0, n → 0 we can rewrite this as the probability offollowing a particular path in the (x, t)-plane–x(τ) from x0 = x(t0) to xn+1 =x(t), specified by xi = x(ti), i.e. as:

P [x(τ)|x0] =

(x.t)∫

(x0,t0)

exp

− 1

2σ2

t∫

t0

(dx

dτ)2dτ

d[x(τ)] (6.6)

but P [x(τ)|x0] = P (x, t|x0, t0) = S(x, t|x0, t0) so it follows from eqn. 5.13that

P [x(τ)|x0] =1√

2πσ2(t− t0)exp

[− (x− x0)

2

2σ2(t− t0)

](6.7)

• The expression d[x(τ)] is shorthand for the “infinitesimally small” setof functions x(τ) such that

x(t0) = x0

x1 < x(t1) < x1 + dx1,...

xn < x(tn) < xn + dxn,

x(t) = x (6.8)

68

i.e.

d[x(τ)] ↔n∏

i=1

dxi (6.9)

•(x,t)∫

x0,t0)

d[x(τ)] ↔ [2πhσ2]−n+1

2

n∏

i=1

∞∫

−∞

dxi (6.10)

• All paths begin at (x0, t0) and end on (x, t). Eqn 6.5 shows that the (dis-cretized) paths which contribute appreciably to the integral in eqn. 6.7are such that:

|xi+1 − xi| = O(σ√h) (6.11)

For such paths

dx

dτ= O(σ

√h)h−1 = O(σh−1/2) → ±∞ as h→ 0

i.e. all the important paths are continuous but non–differentiable.

•t∫

t0

(dx

dτ)2 ↔ h−1

n∑

i=0

(xi+1 − xi)2 (6.12)

• The limit n → ∞ is taken only after all the integrations have beenperformed.

6.1 How to compute path integrals

Consider first the simplest case in which x0 = x = 0, t0 = 0 so that

P (x, t|x0, t0) = P (0, t|0, 0) = S(0, 0, t)

= limn→∞

[2πσ2h]−n+1

2

n∏

i=1

∞∫

−∞

dxi exp

[− 1

2σ2hi

n∑

i=0

(xi+1 − xi)2

]

= [2πσ2t]−1/2 (6.13)

Suppose that we do not know that this Wiener integral has the value [2πσ2t]−1/2.There are then two methods of calculating its value–the spectral representa-tion, and the cell representation.

69

6.1.1 The spectral representation

Letn∑

i=0

(xi+1 − xi)2 =

n∑

i=0

(x2i+1 − 2xi+1xi + x2

i )

≈n∑

j=0

n∑

k=0

xjAjkxk (6.14)

where Ajk is the jkth element of the tridiagonal matrix:

2 −1−1 2 −1

−1 2 −1·

·−1 2 −1

−1 2

Since such a matrix is symmetric, Ajk = Akj, its eigenvalues λj are real. Bydirect computation one can show that

λj = 2 − 2 cos jπ

n+ 1(6.15)

and that the un–normalized jth eigenvector of A has the kth component

ξjk = sin jkπ

n+ 1(6.16)

i.e.Aξj = λjξj (6.17)

[whence −1 · ξjk−1 + 2 · ξjk − 1 · ξjk+1 = λjξjk]

It follows that if we normalize the jth eigenvector to

ξjk =sin jk π

n+1n∑k=1

sin jk πn+1

(6.18)

then the matrix ξ diagonalizes A, i.e.

ξTAξ = Λ (6.19)

70

where Λ is a diagonal matrix with coefficients

λjk = λjδjk. (6.20)

This similarity transformation also diagonalizes the quadratic form:

n∑

j=1

n∑

k=1

xjAjkxk = xTAx (6.21)

Thus letx = ξy, xT = (ξy)T (6.22)

then

xTAx = (ξy)TA(ξy) = yT ξTAξy = yTΛy

orn∑

j=1

n∑

k=1

xjAjkxk =n∑

j=1

λjy2j (6.23)

It follows immediately that

S(0, 0, t) ≡ Se(t) = limn→∞

[2πσ2h]−n+1

2

n∏

i=1

∫ ∞

−∞dyi exp

[− 1

2σ2h

n∑

i=0

λiy2i

]

(6.24)where

n∏

i=1

∫ ∞

−∞dyi =

∫ ∞

−∞· · ·∫ ∞

−∞

∂(x1, x2, · · ·xn)∂(y1, y2, · · · yn)

dy1dy2 · · ·dyn

=

∫ ∞

−∞· · ·∫ ∞

−∞dy1dy2 · · ·dyn (6.25)

since the Jacobian

∂(x1, x2, · · ·xn)∂(y1, y2, · · · yn)

≡ det

(dxjdyk

)= 1

for the transformation defined in eqn. 6.22.But eqn. 6.24 can now be decomposed as:

limn→∞

[2πσ2h]−n+1

2

n∏

i=1

∫ ∞

−∞dyi exp

[− λiy

2i

2σ2h

]

71

and since ∫ ∞

−∞dyi exp

[− λiy

2i

2σ2h

]=√

2πσ2h/λi (6.26)

the lhs of eqn. 6.24 integrates to:

limn→∞

[2πσ2h]−n+1

2

n∏

i=1

√2πσ2h/λi = lim

n→∞[2πσ2h]−

1

2

n∏

i=1

1/√λi (6.27)

However diagonalizing a matrix does not change the value of its determi-nant so

detA =n∏

i=1

λi (6.28)

so eqn. 6.24 further reduces to

limn→∞

[2πσ2h detAn]− 1

2

where we have indexed A by n.By inspection of the matrix A or by showing that detAn satisfies the

recursion equation

detAn = 2 detAn−1 − detAn−2 (6.29)

one can deduce thatdetAn ≡ detA = n+ 1 (6.30)

and so we obtain the final value of the integral as:

limn→∞

[2πσ2h(n + 1)]−1

2 = [2πσ2t]

as expected.

6.1.2 The cell representation

Here we perform the integrations in eqn. 6.13 one by one using

∫ ∞

−∞exp

[−a(x− x′)2 − b(x− x′′)2

]dx =

√π

a + bexp

[− ab

a + b(x′ − x′′)2

]

(6.31)

72

so

[2πσ2h]−1

∫ ∞

−∞dx1 exp

[− 1

2σ2h[(x1 − x0)

2 − (x2 − x1)2]

]=

[2πσ2h]−1[πσ2h]1

2 exp

[−(x2 − x0)

2

4σ2h

]

= [4πσ2h]−1

2 exp

[−(x2 − x0)

2

4σ2h

](6.32)

This expression has the same factors as eqn. 6.4 with h replaced by 2h.Similar factors arise if integrations are performed over x3, x5, x7, · · · to give:

limn→∞

[4πσ2h]−n+1

4

∫ ∞

−∞dx2

∫ ∞

−∞dx4 · · ·

∫ ∞

−∞dxn−1 exp

− 1

4σ2h

n−1

2∑

i=0

(x2(i+1) − x2i)2

(6.33)Using eqn. 6.31 again we find that

[4πσ2h]−1

∫ ∞

−∞dx2 exp

[− 1

4σ2h[(x2 − x0)

2 − (x4 − x2)2]

]=

[4πσ2h]−1[2πσ2h]1

2 exp

[−(x4 − x0)

2

8σ2h

]

= [8πσ2h]−1

2 exp

[−(x4 − x0)

2

8σ2h

](6.34)

so if integrations are performed over x2, x6, x10, · · · we obtain:

limn→∞

[8πσ2h]−n+1

8

∫ ∞

−∞dx4

∫ ∞

−∞dx8 · · ·

∫ ∞

−∞dxn−3 exp

− 1

8σ2h

n−3

4∑

i=0

(x4(i+1) − x4i)2

(6.35)Eventually after m = 1+ log2(n+1) iterations the process terminates at thevalue

limn→∞

[2mπσ2h]−(n+1)/2m

= limn→∞

[2πσ2(n + 1)h]−1/2

= [2πσ2t]−1/2 = Se(t).

This method of evaluating a Wiener integral is known in physics as therenormalization group method.

73

6.2 The connection between the path integral

and the Wiener–Bachelier equation

Suppose we start with eqn. 6.7 in the form:

S(x, t|x0, t0) =

(x,t)∫

(x0,t0)

exp

[− 1

2σ2

∫ t

t0

(dx

dτ)2dτ

]d[x(τ)]

= limn→∞,h→0

[2πσ2h]−n+1

2

n∏

i=1

∫ ∞

−∞dxi exp

[− 1

2σ2h

n∑

i=0

(xi+1 − xi)2

]

where S(x, t|x0, t0) = 0 for t < t0Now perform the integrations and take the corresponding limits for i =

0, 1, 2, · · · , n− 1 keeping xn and h fixed. The result is

S(x, t|x0, t0) = limh→0

[2πσ2h]−1

2

∫ ∞

−∞

[lim

n→∞,h→0[2πσ2h]−

n2

n∏

i=1

∫ ∞

−∞dxi exp

[− 1

2σ2h

n−1∑

i=0

(xi+1 − xi)2

]]

· exp

[− 1

2σ2h(xn+1 − xn)

2

]

= limh→0

[2πσ2h]−1

2

∫ ∞

−∞dxn ·

(x,t)∫

(x0,t0)

exp

[− 1

2σ2

∫ t

t0

(dx

dτ)2dτ

]d[x(τ)]

exp

[− 1

2σ2h(xn+1 − xn)

2

](6.36)

where xn+1 = x(t).As we noted earlier, the paths which contribute appreciably to this inte-

gral are such that|x(t) − xn| = O(

√h)

i.e.xn = x(t) +O(

√h). (6.37)

74

We can therefore expand S(xn, tn|x0, t0) about x(t) to obtain:

S(xn, tn|x0, t0) = S(x, tn|x0, t0) + (xn − x)∂

∂xS(x, tn|x0, t0)

+1

2(xn − x)2 ∂

2

∂x2S(x, t|x0, t0) +O(h3/2) (6.38)

On substituting eqn. 6.38 into eqn. 6.36 we obtain:

S(x, t|x0, t0) = limh→0

[2πσ2h]1

2S(x, t|x0, t0)

∫ ∞

−∞dxn exp

[− h

2σ2

(x− xn)

h

)2]

+ limh→0

[2πσ2h]1

2∂

∂xS(x, t|x0, t0)

∫ ∞

−∞dxn(xn − x) exp

[− h

2σ2

(x− xn)

h

)2]

+ limh→0

[2πσ2h]1

21

2

∂2

∂x2S(x, t|x0, t0)

∫ ∞

−∞dxn(xn − x)2 exp

[− h

2σ2

(x− xn)

h

)2]

+ O(h3/2)

(6.39)

However the integrals in eqn. 6.39 are just Gaussian. In fact, respectively,they are the moments < 1 >= 1, < yn >= 0, < y2

n >= σ2h, where yn =x− xn. Thus eqn. 6.39 reduces to:

h−1[S(x, t|x0, t0) − S(x, tn|x0, t0)] =1

2σ2 ∂

2

∂x2S(x, tn|x0, t0) +O(

√h)

In the final limit as h→ 0, tn → t we obtain

∂

∂tS(x, t|x0, t0) =

1

2σ2 ∂

2

∂x2S(x, t|x0, t0)

which is of course eqn. 2.24, the Wiener–Bachelier equation with µ = 0. Itis also easy to show that

limt0

S(x, t|x0, t0) = limh→0

[2πσ2h]1

2 exp

[−(x− x0)

2

2σ2h

]

= δ(x− x0)

as expected.

75

Thus the path integral for S(x, t|x0, t0) is equivalent to the solution of theWiener–Bachelier equation. Conversely the solution of eqn. 2.24 with µ = 0is given by the Wiener or path integral 6.7 written as:

⟨exp

− 1

2σ2

t∫

t0

(dx

dτ

)2

dτ

⟩

where < · > denotes the sum over all paths such that |x(t)−xn| = O(√h), i.e.

over all Brownian paths, as defined in §3. Note once again that (dx/dτ)2 =O(h−1), i.e. the integral is not well–defined.

6.3 A generalization

Consider now the modified diffusion equation:

∂P

∂t− 1

2σ2∂

2P

∂x2+ V (x, t)P = 0 (6.40)

on −∞ < x < ∞ with SV (x, t|x0, t0) defined for t t0 via the initialcondition

limtt0

SV (x, t|x0, t0) = δ(x− x0). (6.41)

In fact SV (x, t|x0, t0) is the solution of the partial differential equation[∂

∂t− 1

2σ2 ∂

2

∂x2+ V (x, t)

]SV (x, t|x0, t0) = δ(x− x0)δ(t− t0) (6.42)

on −∞ < x <∞,−∞ < t <∞, with SV = 0 for t < t0, just as S0(x, t|x0, t))is the solution of the corresponding equation with V = 0, eqn. 5.2.2.

Eqn. 6.40 has a simple physical interpretation. Suppose V (x, t) is theprobability per unit time that a Brownian ‘particle’ will be absorbed, i.e.that the random walk x[τ ] will terminate. Then in any small interval hthe probability of absorption is V h and the probability of no absorption is1 − V h. In the limit the probability that the Brownian particle will surviveabsorption from (x0, t0) to (x, t) along one path x(τ) is:

P [x(τ)] = exp

−

t∫

t0

V (x(τ), τ)dτ

(6.43)

76

and so the density of Brownian particles at (x, t) which survive absorptionalong all paths from (x0, t0) is

SV (x, t|x0, t0) =

⟨exp

− 1

2σ2

t∫

t0

(dx

dτ

)2

dτ −t∫

t0

V (x(τ), τ)dτ

⟩

(6.44)

To see this note that the Wiener integral in eqn. 6.44 can be written outin full as:

limn→∞,h→0

[2πσ2h]−n+1

2

n∏

i=1

∫ ∞

−∞dxi

exp

[− 1

2σ2h

n∑

i=0

(xi+1 − xi)2 − h

n∑

i=1

V (xi, ti)

](6.45)

for t > t0 and 0 for t < t0.Now expand the second exponential factor as

exp

[−h

n∑

i=1

V (xi, ti)

]=

1 − h

n∑

i=1

V (xi, ti) +1

2!h2

n∑

i=1

n∑

j=1

V (xi, ti)V (xj, tj) − O(h3) (6.46)

so that the expression in 6.45 becomes

limn→∞,h→0

[2πσ2h]−n+1

2

n∏

i=1

∫ ∞

−∞dxi

exp

[− 1

2σ2h

n∑

i=0

(xi+1 − xi)2

]

[1 − h

n∑

i=1

V (xi, ti) +1

2!h2

n∑

i=1

n∑

j=1

V (xi, ti)V (xj, tj) − · · ·]

(6.47)

The first term in this expression is just S0(x, t|x0, t0). The second term canbe rewritten as:

−hn∑

i=1

∫ ∞

−∞dxiS0(x, t|xi, ti)V (xi, ti)S0(xi, ti|x0, t0)

77

and the third term as:

+1

2!h2

n∑

i=1

n∑

j=1

∫ ∞

−∞dxi

∫ ∞

−∞dxj

S0(x, t|xi, ti)V (xi, ti)S0(xi, t− i|xj, tj)V (xj, tj)S0(xj, t− j|x0, t0)

Note however that

limh→0

hn∑

i=1

=

t∫

t0

dti

and the factors 12!, 1

3!, · · · can be left out if the time variables are ordered such

that:t0 < ti < t, t0 < tj < ti < t, · · · .

But this is true by the definition of S0 = 0 for t < t0.It follows from all this that eqn. 6.44 can be reduced to the form:

SV (x, t|x0, t0) = S0(x, t|x0, t0)

−∫ ∞

−∞dx′∫ t

t0

dt′S0(x, t|x′, t′)V (x′, t′)S0(x′, t′|x0, t0)

+

∫ ∞

−∞dx′∫ t

t0

dt′∫ ∞

−∞dx′′

∫ t′

t0

dt′′

S0(x, t|x′, t′)V (x′, t′)S0(x′, t′x′′, t′′)V (x′′, t′′)S0(x

′′, t′′|x0, t0)

− · · · .(6.48)

Such an expansion can be shown to be the solution of the integral equation:

SV (x, t|x0, t0) = S0(x, t|x0, t0)

−∫ ∞

−∞dx′∫ t

t0

dt′S0(x, t|x′, t′)V (x′, t′)SV (x′, t′|x0, t0)

(6.49)

by using the initial trial solution SV = S0 and iterating.Now apply the operator

L ≡ ∂

∂t− 1

2σ2 ∂

2

∂x2(6.50)

78

to eqn. 6.49 and use eqn. 5.2.2 to obtain:

LSV = δ(x− x0)δ(t− t0)−∫ ∞

−∞dx′∫ t

t0

dt′δ(x− x′)δ(t− t′)V (x′, t′)SV (x′, t′|x0, t0)

= δ(x− x0)δ(t− t0) − V (x, t)SV (x, t|x0, t0)

i.e.

(∂

∂t− 1

2σ2 ∂

2

∂x2+ V )SV = δ(x− x0)δ(t−0)

which is just eqn. 6.42. Thus there is a complete equivalence between theWiener integral for SV (x, t|x0, t0) and the above partial differential equation.

• Note that the term

exp

− 1

2σ2

t∫

t0

(dx

dτ

)2

dτ

in eqn. 6.2 is internal and can be absorbed into the definition of theexpectation < · >. Thus we can rewrite eqn. 6.44 as:

SV (x, t|x0, t0) =

⟨exp

−

t∫

t0

V (x(τ), τ)dτ

⟩

(6.51)

This is one version of what is known as the Feynman–Kac formula

relating the source function or propagator of the differential operator∂∂t− 1

2σ2 ∂2

∂x2 + V to the Wiener integral < exp[−∫V dτ ] >.

• Suppose now thatV (x(τ), τ) = V (x) (6.52)

and that V (x) → ∞ as x → ±∞. In such a case the eigenvalueproblem:

1

2σ2d

2X

dx2− V (x)X(x) = −λX(x) (6.53)

has a discrete spectrum λ1, λ2, λ3, · · · with corresponding normalizedeigenfunctions X1(x), X2(x), X3(x), · · · and (as in eqn. 5.37):

SV (x, t|x0, t0) =∞∑

n=1

Xn(x)Xn(x0) exp [−λn(t− t0)] . (6.54)

79

Chapter 7

The Feynman–Kac formula

Consider now the case when

limx→x0,tt0

P (x, t) = P (x0, t0) = φ(x0) (7.1)

for any test function φ(x). As we saw in §3:

P (x, t) =

∫ ∞

−∞dx0P (x, t|x0, t0)P (x0, t0)

=

∫ ∞

−∞dx0SV (x, t|x0, t0)φ(x0) (7.2)

It follows from eqn. 6.51 that:

P (x, t) =

∫ ∞

−∞dx0

⟨exp

−

t∫

t0

V (x(τ), τ)dτ

⟩φ(x(t0)) (7.3)

This expression for P (x, t) is an ordinary Riemann integral over x0 weightedby φ(x(t0)) of exp[−

∫V dτ ] summed over all Brownian paths from x(t0) = x0

to x(t) = x for each x0. We can rewrite it by absorbing∫∞−∞ dx0 into the

Wiener measure

[2πσ2h]−n+1

2

∫ ∞

−∞dx1

∫ ∞

−∞dx2 · · ·

∫ ∞

−∞dxn

as the expectation of

exp

[−∫ t

t0

V (x(τ))dτ

]φ(x(t))

80

over all Brownian paths starting somewhere in −∞ < x0 < ∞. It followsthat:

P (x, t) =

⟨exp

[−∫ t

t0

V (x(τ))dτ

]φ(x(t))

⟩(7.4)

This is the general form of the Feynman–Kac formula associated with theforward differential operator

∂

∂t− 1

2σ2 ∂

2

∂x2+ V ≡ LV .

To compute this we use eqn. 7.2 and we therefore need an explicit expres-sion for SV (x, t|x0, t0). In some cases this can be obtained, e.g. if

V (x(τ)) = V (7.5)

then P (x, t) reduces to

P (x, t) =

∫ ∞

−∞dx0 exp[−V (t− t0)]S0(x, t|x0, t0)φ(x0)

= [2πσ2(t− t0)]− 1

2 exp[−V (t− t0)]∫ ∞

−∞dx0 exp

[− (x− x0)

2

2σ2(t− t0)

]φ(x0)

=exp[−V (t− t0)]√

2π

∫ ∞

−∞dz exp[−z

2

2]φ(x− σ

√(t− t0)z)

(7.6)

for t > t0, where z is a standardized Gaussian random variable with zeromean and unit variance.

7.1 The backward equation

Exactly the same arguments apply to the backward differential operator

− ∂

∂t− 1

2σ2 ∂

2

∂x2+ V ≡ L∗

V .

The result is:

P (x, t) =exp[−V (t− t′)]√

2π

∫ ∞

−∞dz exp[−z

2

2]φ(x+ σz

√(t− t′)) (7.7)

81

for t > t′. So eqn. 7.4 becomes:

P (x, t) = 〈exp[−V (t− t0)]φ(x)〉=

⟨exp[−V (t− t0)]φ(x− σz

√(t− t0))

⟩(7.8)

for t > t0, the forward equation, and

P (x, t) = 〈exp[−V (t0 − t)]φ(x)〉=

⟨exp[−V (t0 − t)]φ(x+ σz

√(t0 − t))

⟩(7.9)

for t < t0, the backward equation.Thus we have linked P (x, t) to the underlying stochastic process W (t)

representing the Brownian motion.

7.1.1 Extension to non–zero µ

From the symmetries of S(x, t) we know that S(x − µt, t) also solves thediffusion equation. Thus the Green’s function for the operator ∂

∂t+ µ ∂

∂x−

12σ2 ∂2

∂x2 + V is SV (x− µt, t). However the Green’s function for the backward

operator − ∂∂t− µ ∂

∂x− 1

2σ2 ∂2

∂x2 +V is SV (−x+µt,−t) = SV (−x− µ(−t),−t),and so eqns. 7.8 and 7.9 are modified in such cases to:

P (x, t) =⟨exp[−V τ ]φ(x− µτ − σz

√τ)⟩, τ = t− t0 > 0

=⟨exp[−V τ ]φ(x + µτ + σz

√τ )⟩, τ = t0 − t > 0

(7.10)

7.2 More general Markov processes

Ito’s lemma, eqn. 3.16 tells us that if we have a stochastic process

dx = K1dt+√K2dW

and if f [x] is a smooth monotonic function of x, then the correspondingstochastic process is given by the equation:

df = (K1fx +1

2K2fxx)dt+

√K2fxdW.

82

The corresponding Kolmogorov equations are:

−∂tP = K1Px +1

2K2Pxx

and

−∂tP = (K1fx +1

2K2fxx)Pf +

1

2K2f

2xPff

However the initial condition (or final condition in the backward case)

P (x, t0) = φ(x0) (7.11)

becomesP (f [x], t0) = φ(exp[f [x0]]) (7.12)

• Example–the lognormal process

dx = µxdt+ σxdW

−∂tP = µxPx +1

2σ2x2Pxx

y = f [x] = log x, x = exp[y]

dy = (µ− 1

2σ2)dt+ σdW

−∂tP = (µ− 1

2σ2)Py +

1

2σ2Pyy

So in this case eqn. 7.10 for the backward case gives

P (y, t) =

⟨exp[−V τ ]φ(exp[y + (µ− 1

2σ2)τ + σz

√τ ])

⟩

and so:

P (x, t) = exp[−V τ ]⟨φ(x exp[(µ− 1

2σ2)τ + σz

√τ ])

⟩. (7.13)

• the expression

x exp[(µ− 1

2σ2)τ + σz

√τ ]

83

describes the exponential stochastic process

x(τ) = x(0) exp[ ˆy(τ)] (7.14)

with x(0) = x and

y = (µ− 1

2σ2)τ + σz

√τ

= (µ− 1

2σ2)τ + σ(W (τ) −W0) (7.15)

from eqn. 3.6. Thus we may finally write, for the backward case:

P (x, t) = exp[−V τ ] 〈φ(x(τ))〉 (7.16)

Eqn. 7.16 also holds for the forward case with

x(τ) = x(0)exp[− ˆy(τ)]. (7.17)

7.3 An alternative derivation

Instead of using the symmetries of S(x, t) we can also proceed as follows.Eqns. 7.6 and 7.7 obtain in case S = S0, the propagator for µ = 0. Can wereduce to this case via other than the coordinate transformation of §5.2.1,x→ x− µt, t→ ±t?

Consider the backward equation for the lognormal process again, in theform:

−∂tP = (µ− 1

2σ2)Py +

1

2σ2Pyy (7.18)

We transform it to a non–dimensionalized forward equation using the sub-stitutions:

τ =1

2σ2(t0 − t), κ =

µ12σ2

(7.19)

and let

P (y, τ) = exp

[−1

2(κ− 1)y − 1

4(κ− 1)2τ

]Q(y, τ) (7.20)

Eqn. 7.18 then reduces to the diffusion equation:

Qτ = Qyy, τ > 0. (7.21)

84

This equation can also be derived from the backward equation 4.11

−∂tP = µPx +1

2σ2Pxx (7.22)

with final condition P (x, t0) = φ(x), leading to P (y, t0) = φ(exp[y]) so that

Q(y, 0) = exp

[1

2(κ− 1)y

]φ(exp[y]) (7.23)

and

Q(y, τ) =1

2√πτ

∫ ∞

−∞exp

[−(y − s)2

4τ

]Q(s, 0)ds

=1√2π

∫ ∞

−∞exp

[−1

2z2

]Q(y +

√2τz, 0)dz

=1√2π

∫ ∞

−∞exp

[−1

2z2 +

1

2(κ− 1)(y +

√2τz)

]φ(exp[y +

√2τz]

)dz.

(7.24)

where z = (s− y)/√

2τ . It follows that

P (y, τ) = exp[−1

4(κ− 1)2]

1√2π

∫ ∞

−∞exp

[−1

2z2

]dz

· exp

[1

2(κ− 1)

√2τz

]φ(exp[y +

√2τz]

)(7.25)

and that

P (x, t) = exp

[−1

4(κ− 1)2 · 1

2σ2(t0 − t)

]

· 1√2π

∫ ∞

−∞exp

[−1

2z2

]dz exp

[1

2(κ− 1)σ

√(t0 − t)z

]

·φ(x exp[σ

√(t0 − t)z]

). (7.26)

By a simple extension it follows that a form equivalent to eqn. 7.13 is:

P (x, t) = exp

[−V (t0 − t) − 1

4(κ− 1)2 · 1

2σ2(t0 − t)

]

· 1√2π

∫ ∞

−∞exp

[−1

2z2

]dz exp

[1

2(κ− 1)σ

√(t0 − t)z

]

·φ(x exp[σ

√(t0 − t)z]

). (7.27)

85

Evidently the expression in eqn. 7.16 is a more compact and natural form.[See Duffie for a derivation of eqn. 7.16 and Wilmott et.al. for eqn. 7.27.]

86

Chapter 8

Options

(Following Cox & Rubenstein)A call option is the right to buy at a fixed price, a share of a specific stock

on or before a specified time te.

• Se = the market price of the stock on the expiration date or expiry te.

• E = the fixed or strike price

• C(te) = the value of the call on one share of the stock at expiry

For a simple vanilla call option we define the call payoff function to beC(Se, te) whose value is defined as:

C(Se, te) = max(0, Se − E) (8.1)

Figure 8.1 shows such a function:

C*

S*0 E

Figure 8.1: Payoff function for a simple vanilla call option

87

Let S(t) be the market price of a share of the stock at time t < te and C(t)the value of the call option on one share of the stock at time t. Evidently

max(0, S(te) − E) ≤ C(t) ≤ S(t) (8.2)

We can use the no arbitrage condition to prove this.In similar fashion we define a put option as the right to sell a share of a

specific stock at a fixed price, on or before a specified time te, and define theput payoff function to be P (Se, te) whose value is:

P (Se, te) = max(0, E − Se) (8.3)

The put payoff function is shown in figure 8.2. Let P (t) be the value of

S*0

P*

E

Figure 8.2: Payoff function for a simple vanilla put option

the put option on one share of the stock at time t. Using the no arbitragecondition we can again prove that:

max(0, E − S(te)) ≤ P (t) ≤ E (8.4)

Can we improve on these inequalities? We saw in §1 that C and P dependonly on S(t), Se and E, the interest rate r, the trading period τ = te− t, andon the volatility of S over the trading period. In what follows we derive apartial differential equation for C(S, t) using the two basic hypothesis:

• the return on S is a Markov process

• arbitrage is not possible

88

8.1 Deriving the Black–Scholes equation

Recall our basic random walk depicted in figure 8.3. From this figure wecan see that the rate of return on S is u − 1 with probability p and d − 1with probability q. Assuming that a bank deposit of $D increases to $(erD)over one trading period, the interest rate over one period is er − 1. The noarbitrage condition then requires that:

d < er < u (8.5)

S

uS

dSq

p

Figure 8.3: The random walk of S.

In addition since C = max(0, S − E) at t = te, we can write the diagramshown in figure 8.4. i.e. over one period of duration τ , C(S, te − τ) →C(Se, te) = max(0, uS−E) with probability p and C(S, te−τ) → C(Se, te) =max(0, dS − E) with probability q.

Consider now the following scenario. Form a portfolio comprising ∆shares of stock and $B worth of riskless bonds. If the total value of thisportfolio at time te−τ before expiry is $(∆S+B) then at te with probabilityp its value will be $(∆uS + erB), and with probability q its value will be$(∆dS + erB) as shown in figure 8.5. But we saw in §1 that an appropri-ate position in stock can replicate the future returns of a call option. Wecan therefore identify $(∆dS + erB) with C(dS, te), and $(∆uS + erB) withC(uS, te), so that:

∆dS + erB = C(dS, te)

∆uS + erB = C(uS, te) (8.6)

89

q

p

t0C(S, -τ)

C(uS,t0 )

C(dS,t0 )

Figure 8.4: The random walk of C.

q

p

∆S+B

∆dS+ Ber

∆uS+ Ber

Figure 8.5: The random walk of a portfolio.

90

and if there is to be no arbitrage, $(∆S +B) with C(S, te − τ), i.e.

∆S +B = C(S, te − τ) (8.7)

It follows from eqn. 8.6 that

∆ =C(uS, te) − C(dS, te)

(u− d)S, B =

e−r[uC(dS, te) − dC(uS, te)]

u− d(8.8)

On substituting these expressions for ∆ and B into eqn. 8.7 we obtain theequation:

C(S, te − τ) =C(uS, te) − C(dS, te)

(u− d)+e−r[uC(dS, te) − dC(uS, te)]

u− d

=1 − e−rd

u− dC(uS, te) +

e−ru− 1

u− dC(dS, te) (8.9)

Let

p =er − d

u− d, q = 1 − p =

u− er

u− d(8.10)

thenC(S, te − τ) = e−r [pC(uS, te) + qC(dS, te)] (8.11)

Some points to note:

• the probability p does not appear in eqn. 8.11

• C(S, te − τ) is independent of investor’s attitudes towards risk

• The only random variable in eqn. 8.11 is S

• 0 < p < 1 so p is like a probability. In fact p is the value p would havein equilibrium if investors were risk neutral. That is,

p(uS(te − τ)) + q(dS(te − τ)) = erS(te − τ) =< S(te) >

whencepu+ qd = pu+ (1 − p)d = er

and

p =er − d

u− d= p, q =

u− er

u− d= q

Thus the value of a call equals the expectation of its discounted futurevalue in a risk neutral world.

91

• This does not imply that the equilibrium expected rate of return onthe call is the riskless interest rate

• One can also show that e−rp is the value of a claim that will pay $1at te if and only if S(te − τ) → uS(te). Similarly e−rq is the value ofa claim that will pay $1 at te if and only if S(te − τ) → dS(te). Thepayoff to a call is equivalent to that of a package containing C(uS, te)units of the first claim and C(dS, te) units of the second, so its valueshould be e−r[pC(uS, te) + qC(dS, te)]

8.1.1 More frequent trading

We now subdivide the trading period τ into n intervals of duration h. So

τ = nh (8.12)

Let er− 1 be the interest rate over a short trading period of duration h priorto expiry. Evidently

er = erh (8.13)

Then ern = ernh = erτ and eqn. 8.11 for a trading period of duration h canthen be written as:

C(S, te − h) = e−rh [pC(dS, te) + qC(uS, te)] (8.14)

where

p =er − d

u− d, q = 1 − p =

u− er

u− d(8.15)

We now use eqn. 2.12 and choose

u = exp(σ√h), d = exp(−σ

√h) (8.16)

whence

p =erh − e−σ

√h

eσ√h − e−σ

√h, q = 1 − p =

eσ√h − erh

eσ√h − e−σ

√h

(8.17)

It follows that eqn. 8.14 can be expanded as:

erhC(S, te − h) =erh − e−σ

√h

eσ√h − e−σ

√hC(eσ

√hS, te)

+eσ

√h − erh

eσ√h − e−σ

√hC(e−σ

√hS, te)

92

Now lett = te − h (8.18)

to obtain:

erhC(S, t) =erh − e−σ

√h

eσ√h − e−σ

√hC(eσ

√hS, t + h)

+eσ

√h − erh

eσ√h − e−σ

√hC(e−σ

√hS, t + h)

and Taylor expand C(e±σ√hS, t+ h) about C(S, t) using

f(αx) = f(x) + (α− 1)xf ′(x) +1

2(α− 1)2x2f ′′(x) (8.19)

to give

C(e±σ√hS, t+ h) = C(S, te) + (e±σ

√h − 1)S

∂C

∂S+

1

2(e±σ

√h − 1)2S2∂

2C

∂S2

+h∂C

∂t+ · · · (8.20)

and then use the expansions:

e±σ√h = 1 ± σ

√h +

1

2σ2h± O(h3/2) (8.21)

and

erh = 1 + rh+1

2r2h2 − O(h3) (8.22)

The result is that to O(h2),

−h∂C∂t

= hrS∂C

∂S+h

2σ2S2∂

2C

∂S2− hrC +O(h2)

In the limit h→ 0, we obtain the partial differential equation

−∂C∂t

= rS∂C

∂S+

1

2σ2S2∂

2C

∂S2− rC (8.23)

This is the Black–Scholes equation. It is a backward partial differential equa-tion with final condition

C(S, te) = max(0, S(te) − E) (8.24)

Some points of interest

93

• In the limit of small h,

p→ 1

2(1 +

µ

σ

√h); q → 1

2(1 − µ

σ

√h)

where

µ = r − 1

2σ2 (8.25)

• It also follows that(q − p) log u = −µh (8.26)

as in eqn. 2.23.

• Evidently the Black–Scholes equation is just the normal limit of thebackward recursion given in eqn. 4.10 with the drift coefficient µ re-placed by the r

• Recall that er−1 is the interest rate α over a trading period of durationτ and the total return over the period τ is (1 + α)τ . Similarly er − 1is the interest rate α/n over a trading period of duration τ/n = h, andthe total return over n such trading periods is (1 + α/n)n. But

limn→∞

(1 +α

n)n = eα

and since (1 + α/n)n = (1 + α)τ = erτ it follows that

α

τ= r (8.27)

i.e. r is the continuously compounded interest rate we introduced in§1.

8.2 Martingales

Given a Wiener process W (t) with probability density:

P (W, t|0, 0) = [2π t]−1/2 exp

[−W (t)2

2t

](8.28)

and initial conditionW (0) = W0 = 0 (8.29)

94

it follows that:〈W (t)〉 = 0,

⟨W (t)2

⟩= t (8.30)

The autocorrelation function is defined as:

〈W (t)W (s)|0〉 =

∫ ∫dW1dW2W1W2P (W1, t;W2, s|0, 0) (8.31)

If t > s then

〈W (t)W (s)|0〉 ≡ 〈W (t)W (s)〉 =⟨(W (t) −W (s))W (s) +W 2(s)

⟩

= 〈(W (t) −W (s))W (s)〉 +⟨W 2(s)

⟩

= 〈(W (t) −W (s))〉〈W (s)〉 +⟨W 2(s)

⟩

=⟨W 2(s)

⟩= s

and if t < s〈W (t)W (s)|0〉 ≡ 〈W (t)W (s)〉 = t

so that, in general

〈W (t)W (s)|0〉 ≡ 〈W (t)W (s)〉 = min(t, s) (8.32)

Now consider the following application of the Ito calculus. Eqn. 3.14when applied to the random function

exp

[−1

2t+W (t)

]

gives

d(exp

[−1

2t+W (t)

]) = −1

2exp

[−1

2t+W (t)

]dt+

1

2exp

[−1

2t+W (t)

]dt

+ exp

[−1

2t +W (t)

]dW

= exp

[−1

2t+W (t)

]dW

Thus exp[−1

2t +W (t)

]and not exp [W (t)] is the eigenfunction of stochastic

differentiation.

95

We extend this result to the (random) function

Z(t) = Z0 exp

[(µ− 1

2σ2)t+ σ(W (t) −W0)

](8.33)

whencedZ(t) = µZ(t)dt+ σZ(t)dW (t) (8.34)

as expected.

In addition⟨

(µ− 1

2σ2)t+ σ(W (t) −W0)

⟩= (µ− 1

2σ2)t+ σ 〈W (t) −W0〉

= (µ− 1

2σ2)t

It now follows from eqn. 2.20 that

⟨exp

[(µ− 1

2σ2)t + σ(W (t) −W0)

]⟩= 〈Z(t)〉 = exp[µt] (8.35)

Consider now the (random) function

exp

[(µ− 1

2σ2)(t− s) + σ(W (t) −W (s))

]≡ Z+, t > s

Evidently we can writeZ(t) = Z+Z(s) (8.36)

so that〈Z(t)|W (s)〉 =

⟨Z+Z(s)|W (s)

⟩=⟨Z+|W (s)

⟩Z(s)

since if W (s) is given, so is Z(s).But

⟨(µ− 1

2σ2)(t− s) + σ(W (t) −W (s))|W (s)

⟩

= −(µ− 1

2σ2)(t− s) + σ 〈W (t)|W (s)〉 − σ 〈W (s)|W (s)〉

= −(µ− 1

2σ2)(t− s) + σW (s) − σW (s) = (µ− 1

2σ2)(t− s)

96

It follows once again from eqn. 2.20 that

⟨Z+|W (s)

⟩=

⟨exp

[(µ− 1

2σ2)(t− s) + σ(W (t) −W(s))

]⟩

= exp

[(µ− 1

2σ2)(t− s) +

1

2σ2(t− s)

]= exp[µ(t− s)]

and therefore that

〈Z(t)|W (s)〉 = eµ(t−s)Z(s), t > s (8.37)

Thus the expectation of Z given W (s) = Ws, for times t > s is justeµ(t−s)Z(s). Evidently

〈Z(t)|W (s)〉 ≥ Z(s), t > s (8.38)

This defines Z(t) as a Submartingale.In case µ = 0 it follows that

〈Z(t)|W (s)〉 = Z(s), 〈Z(t)〉 = 1, t > s (8.39)

This defines Z(t) as a Martingale

In case µ < 0〈Z(t)|W (s)〉 ≤ Z(s), t > s (8.40)

In which case Z(t) is a Supermartingale.

8.2.1 Martingales and options

What is the connection between Martingales and options? Eqn. 8.14 can bewritten in the form:

〈C(S, t)〉P = er(t−s)C(S, s), t > s (8.41)

where P refers to the risk–adjusted probabilities p and q introduced earlier.It follows immediately that C(S, t) is a Submartingale.

LetZ(t) = e−r(t−s)C(t), t > s (8.42)

On substituting this into eqn. 8.41 we obtain

〈Z(S, t)〉P = Z(S, s), t > s (8.43)

i.e. C(S, t) after discounting and using risk adjusted probabilities, is a Mar-tingale.

97

8.3 Girsanov’s Theorem

Now consider again the standardized Gaussian random variable z with zeromean and unit variance. Its distribution is just

P (z)dz =1√2π

exp

[−1

2z2

]dz (8.44)

Let

Tµ(z) = exp

[µz − 1

2µ2

](8.45)

It follows that

Tµ(z)P (z) =1√2π

exp

[−1

2(z − µ)2

]

We can rewrite this in terms of the new probability density

P (z) = Tµ(z)P (z) (8.46)

as

P (z)dz =1√2π

exp

[−1

2(z − µ)2

]dz (8.47)

So the action of Tµ shifts the probability density of z to that of z − µ. Inso doing the probability measure changes from P (z)dz = dP(z) to P (z)dz =dP(z).

Evidently this change of measure is reversible. Thus let

T−1µ (z) =

1

Tµ(z)= exp

[1

2µ2 − µz

](8.48)

ThenT−1µ (z)P (z) = P (z) (8.49)

• We can write this action as

Tµ(z) =dP(z)

dP(z)(8.50)

where d/dP(z) is the Radon–Nikodym derivative with respect to themeasure dP(z).

98

Girsanov’s theorem is a simple extension of this action to random processes.Thus, if W (t) is a Wiener–Bachelier process with measure dP(z) = P (z)dzthen dP(z) is the measure of a new Wiener–Bachelier process W (t) such that

dW = µdt+ dW (8.51)

• Consider the stochastic differential equation

dx = µdt+ σdW (8.52)

with

P (x) =1√

2πσ2texp

[−(x− µt)2

2σ2t

](8.53)

It is easy to see that

P (x) = T−1µ/σ(x)P (x) =

1√2πσ2t

exp

[− x2

2σ2t

](8.54)

and we can rewrite eqn. 8.52 as

dx = σdW (8.55)

• It follows from eqns. 8.33 and 8.34 that

S(t) = S0 exp

[(µ− 1

2σ2)t+ σ(W (t) −W0)

](8.56)

is a solution of the equation

dS = µSdt+ σSdW (8.57)

• Girsanov’s theorem now indicates that we can replace the drift termµSdt in eqn. 8.57 with another term (which need not be zero), forexample with the term rSdt, whence:

dS = µSdt+ σSdW = rSdt+ σSdW (8.58)

It follows that

dW =µ− r

σdt + dW = λdt+ dW (8.59)

and that

S(t) = S0 exp

[(r − 1

2σ2)t+ σ(W (t) − W0)

](8.60)

is a solution of eqn. 8.58.

99

8.4 Solving the Black–Scholes equation

To solve eqn. 8.23 we first transform it into a form suitable for the applicationof the Feynman–Kac formula.Let

x = logS (8.61)

then

S∂

∂S=

∂

∂x, S2 ∂

2

∂S2=

∂2

∂x2− ∂

∂x(8.62)

Eqn. 8.23 then becomes

− ∂

∂tC(x, t) = (r − 1

2σ2)

∂C

∂x+

1

2σ2∂

2C

∂x2− rC (8.63)

with final conditionC(x, te) = max(0, ex − E) (8.64)

What about boundary conditions? Evidently if S = 0 then uS = dS = 0 soS ≡ 0 and a call option is worthless, whence

C(0, t) = 0

If S → ∞, max(0, S − E) → S so

limS→∞

C(x, t) = S

These conditions translate into

C(−∞, t) = 0, limx→∞

C(x, t) = ex (8.65)

Before proceeding further we non–dimensionalize eqn. 8.63 via

τ =1

2σ2t, κ = r/

1

2σ2 (8.66)

This leads to the equation:

− ∂

∂τC(x, τ) = (κ− 1)

∂C

∂x+∂2C

∂x2− κC (8.67)

We are now able to apply the Feynman–Kac formula in the form of eqn. 7.10,i.e.

C(x.τ) = exp[−κτ ]⟨C(x + (κ− 1)τ + z

√2τ)⟩

(8.68)

100

where τ = τ0 − τ ≥ 0.

Before evaluating this expression we rederive it by solving the partial differ-ential equation 8.67 directly.

We note that since dτ = −dτ we may rewrite this equation as a forward

equation in τ i.e. as:

∂

∂τC(x, τ ) = (κ− 1)

∂C

∂x+∂2C

∂x2− κC (8.69)

We next remove the term −κC using the transformation

C(x, τ ) = exp[−κτ ]Q(x, τ ) (8.70)

as in eqn. 5.26. The result is

∂

∂τQ(x, τ ) = (κ− 1)

∂Q

∂x+∂2Q

∂x2(8.71)

with Q(x, τ0) = C(x, τ0) = C(x, 0).

We now remove the drift term using the coordinate shift

y = x + (κ− 1)τ , τ = τ (8.72)

as in eqn. 5.18. The result is the equation

∂

∂τQ(y, τ) =

∂2Q

∂y2(8.73)

with Q(y, 0) = Q(x, 0). Its solution on −∞ < y <∞ is just

Q(y, τ) =1

2√πτ

∞∫

−∞

exp

[−(y − s)2

4τ

]Q(s, 0)ds (8.74)

whereQ(y, 0) = max(ey − E, 0) (8.75)

101

Thus

Q(y, τ) =1√2π

∞∫

−∞

exp

[−1

2z2

]Q(y +

√2τ z, 0

)dz

=1√2π

∞∫

−∞

exp

[−1

2z2

]max

(ey+

√2τ z − E, 0

)dz

=1√2π

∞∫

−(y−logE)/√

2τ

exp

[y +

√2τ z − 1

2z2

]dz

− E√2π

∞∫

−(y−logE)/√

2τ

exp

[−1

2z2

]dz (8.76)

The first integral reduces to

ey√2π

∞∫

−(y−logE)/√

2τ

exp

[−1

2z2 −

√2τ z

]dz

=ey+τ√

2π

∞∫

−(y−logE)/√

2τ

exp

[−(z√2−

√τ

)2]dz

=ey+τ√

2π

∞∫

−(y+2τ−logE)/√

2τ

exp

[−1

2z2

]dz

= ey+τΦ

(y + 2τ − logE√

2τ

)(8.77)

using eqn. 2.6.

Similarly the second integral reduces to

EΦ

(y − logE√

2τ

)(8.78)

i.e.

Q(y, τ) = ey+τΦ

(y + 2τ − logE√

2τ

)− EΦ

(y − logE√

2τ

)(8.79)

102

Using eqns. 8.72 and 8.70 this expression reduces to:

C(x, τ) = exΦ

(x+ (κ+ 1)τ − logE√

2τ

)− Ee−κτΦ

(x + (κ− 1)τ − logE√

2τ

)

(8.80)

Finally from eqn. 8.66 we obtain:

C(S, T ) = SΦ(y) − Ee−rTΦ(y − σ√T ) (8.81)

where T = te − t and:

y =log[S/Ee−rT

]

σ√T

+1

2σ√T (8.82)

Eqn. 8.81 is the Black–Scholes formula for C as a function of S(T ) where Tis the time to expiry of the option. Note that

•C(S, 0) = (S − E)Φ(∞) = S − E, S ≥ E (8.83)

•C(0, T ) = (S − Ee−rT )Φ(−∞) = 0 (8.84)

•limS→∞

C(S, T ) = SΦ(log S) = S (8.85)

• e−rT is the $ amount that would have to be paid at time t in order toobtain $ 1 with certainty at time te. So if payment of the strike priceE will be made at time te its value at time t is Ee−rT .

A formula similar to that in eqn. 8.81 can also be derived for the Europeanput option. A quick way to derive it is to use put–call parity, eqn. 1.1 in theform:

C − P = S − E exp[−rT ] (8.86)

whence from eqn. 8.81

P (S, T ) = −S(1 − Φ(y)) + Ee−rT (1 − Φ(y − σ√T ))

103

butΦ(y) + Φ(−y) = 1 (8.87)

soP (S, T ) = Ee−rTΦ(σ

√T − y) − SΦ(−y) (8.88)

This is the Black–Scholes formula for the European put.

8.5 Another derivation of the Black–Scholes

equation

Consider an option V (S, t), either a European call or put. If the return onS is lognormally distributed it follows from Ito’s lemma, eqn. 3.16, that

dV =

(∂V

∂t+ µS

∂V

∂S+

1

2σ2S2∂

2V

∂S2

)dt+ σS

∂V

∂SdW (8.89)

Now build a portfolio Π comprising the option V and −∆S shares of stock:

Π = V − ∆S (8.90)

It follows that

dΠ =

(∂V

∂t+ µS

∂V

∂S+

1

2σ2S2∂

2V

∂S2

)dt

+σS∂V

∂SdW − µ∆Sdt− σ∆SdW

=

(∂V

∂t+ µS

∂V

∂S+

1

2σ2S2∂

2V

∂S2− µ∆S

)dt

+σS(∂V

∂S− ∆)dW (8.91)

We can now eliminate the random component in this expression by choosing

∆ =∂V

∂S(8.92)

Eqn. 8.91 then reduces to:

dΠ =

(∂V

∂t+

1

2σ2S2∂

2V

∂S2

)dt (8.93)

104

Now suppose we invest $ Π in riskless assets such as US Treasury bonds. Intime dt it follows that

dΠ = rΠdt (8.94)

If there are no arbitrage opportunities it follows that

rΠdt = r(V − ∆S)dt = r(V − S∂V

∂S)dt =

(∂V

∂t+

1

2σ2S2∂

2V

∂S2

)dt

i.e.

−∂V∂t

= rS∂V

∂S+

1

2σ2S2∂

2V

∂S2− rV (8.95)

the Black–Scholes equation for V(S,t).

Note the following:

• Vt + 12σ2S2VSS is the return on the portfolio V − SVS, and r(V −

SVS) is the return on a bank deposit of $ V − SVS. The no arbitragecondition implies that the two returns are equal, hence the Black–Scholes equation.

• We can also re-interpret eqn. 8.81 through eqns. 8.7 and 8.90 as:

C(S, T ) = ∆S + Π = Φ(y)S − Ee−rTΦ(y − σ√T )

so Φ(y)S is the $ amount invested in S and Ee−rTΦ(y − σ√T ) is the

$ amount borrowed to make an equivalent portfolio consisting of along position (holding a stock until expiry) in less than one share of Sfinanced partly by borrowing.

8.6 The Greeks

• Not surprisingly∆ = VS (8.96)

is called the delta of an option. It follows from eqn. 8.81 that thedelta for a European call is just Φ(y). The corresponding delta for aEuropean put is Φ(y) − 1.

105

• The quantity

Y ≡ S

V= S

VSV

(8.97)

is called the elasticity of an option.

• The quantityΓ ≡ ∆S = VSS (8.98)

is called the gamma of an option.

• The quantityΘ ≡ −VT (8.99)

is called the theta of an option.

• Finally Vσ is called the vega and Vr the rho of an option.

For calls it follows that:

∆ = CS = Φ(y) > 0 (8.100)

CE = −e−rTΦ(y − σ√T ) < 0 (8.101)

Θ = −CT < 0 (8.102)

Cσ > 0, Cr > 0 (8.103)

There are similar expressions for puts.

8.7 Hedging

We introduced several examples of hedges in §1 in pricing calls and puts,and in this chapter we used so–called delta hedging in continuous tradingto obtain the Black–Scholes equation. The earlier hedges were one–timepurchases of shares of a stock S, the selling of puts, and the borrowing of $Ee−rT and the writing or one–time sale of calls. The second is dynamic–itmust be continuously monitored (and can incur losses due to transactionscosts).

106

8.7.1 Delta hedging

One can use delta hedging to cover a position when writing an option, if onecan obtain a premium slightly above its the fair value, then one can trade inthe underlying to maintain a delta–neutral position until expiry. Recall thatΠ = V − ∆S so ΠS = VS − ∆ = 0 if ∆ = VS. More generally the delta of aportfolio Π is

∆Π =∂Π

∂S(8.104)

As we have seen in delta–hedging, the number of stocks purchased equals thedelta of the option

∆V ≡ ∂V

∂S

whence ∆Π = 0. As we saw earlier this choice eliminates the random com-ponent in the variation of the portfolio Π.

One can also carry out Γ–hedging, Θ–hedging, σ–hedging, and ρ–hedging bybalancing a portfolio Π with respect to the correct number of shares of theunderling S. Hedging can thus eliminate the short–term dependence of theportfolio on variations in S, T, σ, and r.

8.8 Implied volatility

Despite our assumption to the contrary, volatility is not constant over longperiods of time–so a direct measurement of σ is difficult. However, optionprices are quoted in the market, so that there exists an implied volatility.This is obtained by substituting in the Black–Scholes formula, eqn. 8.88, the(easily obtained) quoted values of r, S, E and T . We are left with the optionprice V as a function only of the volatility σ. But the vega of a call optionis positive hence V increases monotonically with σ. So the function V (σ) isinvertible. Thus given a quoted value of the option V one can deduce a valuefor the volatility σ. This is the implied volatility.

Interestingly σ also varies with E for fixed r,S, and T as shown in fig. 8.6.Evidently σ is not independent of E but increases with increasing |S − E|tracing out a curve called a volatility smile. This tells us that there is some-thing wrong with the Black–Scholes model which assumes that volatility isconstant.

107

2.6 2.7 2.8 2.9 3.0

0.0

0.1

0.2

0.3

S

E.10-3

σ

Figure 8.6: Implied volatility σ as a function of the of the strike price E.

8.9 Dividends

Suppose now that shares of stock in a company continuously pay a constantdividend. Thus in a short time dt the stock pays out a dividend

D0Sdt

where D0 is a constant.

Let S− be the stock price just before it pays out and S+ the price just after.Suppose that S+ > S− − D0S

−dt, then we could buy S just before it paysout and then sell it just after, receiving S+ + D0S

−dt > S−. Conversely ifS+ < S− −D0S

−dt we could short S and then buy it back to earn a profitwithout risk. Arbitrage arguments therefore indicate that:

S+ = S− −D0S−dt (8.105)

It follows that the basic lognormal stochastic differential equation needs tobe modified to express the exponential decay of S, i.e.

dS

S= (µ−D0)dt+ σdW (8.106)

108

and the associated portfolio becomes

Π = V − ∆S exp[−D0T ] (8.107)

so that

dΠ = dV − ∆(dS exp[−D0T ] +D0 exp[−D0T ]dt) (8.108)

These equations can then be used to derive the Black–Scholes equation withdividends as:

−∂V∂t

= (r −D0)S∂V

∂S+

1

2σ2S2∂

2V

∂S2− rV (8.109)

This equation also applies to short dated options on foreign currencies, inwhich case D0 = rf .

8.9.1 Effects of dividends on boundary conditions

• For a call option:C(S, 0) = max(0, S − E) (8.110)

C(0, T ) = 0 (8.111)

as before

• However, as S → ∞C(S, T ) → Se−D0T (8.112)

i.e. C becomes equivalent to S but without its dividend income.

One can also show that the Black–Scholes formula with dividends is:

C(S, T ) = Se−D0TΦ(y′) − Ee−rTΦ(y′ − σ√T ) (8.113)

where :

y′ =log[Se−D0T/Ee−rT

]

σ√T

+1

2σ√T (8.114)

109

Chapter 9

American options

An American option differs from a European one in that exercise of theoption is permitted at any time during the lifetime of the option. It shouldtherefore have a higher value. Consider the value of P (S, T ) for a European

S0

P

T=0

T=1.5

1

Figure 9.1: P (S, T ) computed from the Black–Scholes formula at T = 0 andT = 1.5, for r = 0.1, σ = 0.2 and E = 1.

put, computed from the Black–Scholes formula, eqn. 8.88 for r = 0.1, σ =0.2, E = 1 as shown in fig. 9.1. Evidently there is a range of values of S forwhich

P (S, T ) ∼ Ee−rT − S < E − S = 1 − S (9.1)

110

Assume S lies in this range, so that S < 1 and suppose that one couldexercise this option before expiry. Then arbitrage opportunities exist–onecould buy the put and exercise it by selling the stock for the strike price of$ 1 per share, and then buy the stock for $ S per share, to realize a netprofit of $ 1 − P − S > 0 without risk. In order to preclude such arbitrageopportunities we therefore require that

P (S, T ) ≥ max(0, E − S) (9.2)

i.e. eqn. 8.4. In fact eqn. 8.4 is true for American options, but not true forEuropean ones. Thus

PA(S, T ) 6= PE(S, T ) for S < E (9.3)

Consider now a call option on a dividend paying asset. For large enough Sit follows from eqn. 8.113 that

C(S, T ) → e−D0TS < S − E if D0 > 0 (9.4)

So once again arbitrage opportunities exist unless

C(S, T ) ≥ max(0, S − E) (9.5)

i.e. eqn. 8.2 andCA(S, T ) 6= CE(S, T ) for S > E (9.6)

In general if V (S, T ) is the value of an option at time T > 0 before expiry,then

VA(S, T ) > VE(S, T ) (9.7)

9.1 Boundary conditions for American options

It is clear from an examination of fig. 9.1 that for some value of S = SF < Eat each time to expiry T , the inequality in eqn. 9.2 is violated. More exactly,there exists a value SF (T ) such that

P (S, T ) > P (S, 0) = max(0, S(0) − E), S(T ) > SF

P (S, T ) < P (S, 0) = max(0, S(0) − E), S(T ) < SF (9.8)

Evidently the option should be held when S > SF and it should be exercisedwhen S < SF . The problem is that we do not know the value of SF a priori.It therefore generates a free boundary. Thus the pricing of American optionsrequires us to solve a free boundary problem for the Black–Scholes equation.

111

9.2 The obstacle problem

To motivate the calculation we first consider a simple example of the freeboundary problem–the so–called obstacle problem. An elastic string is stretchedover a curved object between the endpoints a and b. We need to calculateits position–in particular where it first makes contact with the surface of theobject. Fig. 9.2 shows the problem.

elastic string

a b

obstacle

Figure 9.2: An elastic string in contact with the surface of an object.

There are several important points to note:

• the string must lie above or on the obstacle

• the string must have negative or zero curvature

• the string must be continuous

• the string slope must be continuous

When the string is in contact with the obstacle its position is known. Whenit is not in contact, it must satisfy an equation of motion which tells usthat it is straight. Since the string must lie on or above the obstacle itmust have zero or negative curvature (the obstacle can push but not pull).Finally the string and its slope must be continuous, even when the stringfirst loses contact with the object (justified by a local force balance–a lateralforce would be needed to create a kink). Thus even if the region of contactis not known a priori the problem has a unique solution–the string and itsslope are continuous, but its curvature, i.e. its second spatial derivative isdiscontinuous.

112

The American option problem is similar–it can be uniquely specified by thefollowing set of constraints.

• the option value ≥ the payoff function

• the equality in the Black–Scholes equation is replaced by an inequality

• V is a continuous function of S

• VS is a continuous function of S

The first of these constraints is equivalent to saying that the arbitrage profitfrom early exercise of the option is never positive. The second implies thateither the value of V equals the payoff function, or else if it exceeds it, itsatisfies the Black–Scholes equation. These two statements can be combinedinto one inequality associated with the Black–Scholes formulation. The thirdconstraint follows from simple arbitrage. The fourth requires that VS be acontinuous function of S.

As an example consider the American put. We found earlier that thereexists an exercise boundary SF (T ) < E, so the slope of the payoff functionmax(0, E − S) at SF is just (E − S)S = −1. There are three possibilities forthe value of PS at S = SF :

• (a) PS < −1, (b) PS > −1, and (c) PS = −1

Evidently in case (a) for S > SF , P < E − S which contradicts eqn. 9.2. Infact case (a) cannot occur for a European put since PS = Φ(y)− 1 which liesbetween −1 and 0. However case (b) can occur. Recall that P (S, T ) satisfiesthe Black–Scholes equation with the condition at early exercise:

P (SF (T ), T ) = max(E − SF (T )) (9.9)

so the choice of SF (T ) affects the value of P (S, T ) for all S > SF (T ). Thusif PS > −1 at S = SF the value of P at S near SF can be increased bychoosing a smaller SF in which case PS decreases. Eventually PS → −1, theno arbitrage condition.

Another argument: Suppose S ∼ SF and consider the simple portfolio, longone share S and one put P .

Π = P + S

113

then over a small interval dt

dΠ = dP + dS

Since P = E − S for S < SF , dP = −S and dΠ = 0, i.e. for a downward

move of S. Conversely, for an upward move, it follows from Ito’s lemma that

dΠ = σS(PS + 1)dW+ +O(dt) (9.10)

and therefore

〈dΠ〉 = σS(PS + 1)1

2〈|dW |〉+O(dt) (9.11)

However from eqn. 3.1 we have:

dW = z√dt

so it follows that

1

2〈|dW |〉 =

1

2〈|z|〉

√dt =

1

2

∫ ∞

−∞|z| exp

[−z

2

2

]dz√2π

√dt

where |z| = z for z > 0 and |z| = −z for z < 0.

But

1

2

∫ ∞

−∞|z| exp

[−z

2

2

]dz√2π

=

∫ ∞

0

z exp

[−z

2

2

]dz√2π

= −∫ ∞

0

d

dzexp

[−z

2

2

]dz√2π

=1√2π

whence1

2〈|dW |〉 =

√dt

2π

so that

〈dΠ〉 = σS(PS + 1)1

2

√dt

2π+O(dt) (9.12)

We now use the fact that over a small interval dt the return on a risklessportfolio is rΠdt = O(dt), so the no arbitrage condition requires

〈dΠ〉 = O(dt)

114

whencePS + 1 = 0 (9.13)

Now let us reconsider the second constraint–that no arbitrage leads leads toan inequality, i.e., the return on a portfolio is less than or equal to the returnon a bank deposit. Thus for an American put

∂P

∂t+ rS

∂P

∂S+

1

2σ2∂

2P

∂S2− rP ≤ 0 (9.14)

When it is optimal to hold the option the equality obtains and

P (S, T ) > max(0, E − S) (9.15)

Otherwise it is optimal to exercise the option so only the inequality ineqn. 9.14 obtains and

P (SF , T ) = max(0, E − SF ) (9.16)

(the obstacle is the solution). Fig. 9.3 shows the results of imposing suchconstraints.

S0

P

1

1

(a)

(b)

SF

Figure 9.3: (a) European put (b) American put; r = 0.1, σ = 0.4, E = 1, T = 6months.

Evidently we can divide the S–axis into two regions. Let

L ≡ ∂

∂t+ rS

∂

∂S+

1

2σ2 ∂

2

∂S2− r (9.17)

Then

115

• In region 1, S > SF , LP = 0 and it is optimal not to exercise the put

• In region 2, S < SF , LP < 0 and early exercise of the put is optimal. Tosee this suppose that P = E−S > 0, then since L(E−S) = −rE < 0,the return from the portfolio Pt +

12σ2PSS is less than the return on a

bank deposit of rP − rSPS.

The location of the free boundary, SF is determined by eqn. 9.9 and by aconsequence of eqn. 9.13, namely

∂

∂SP (SF (T ), T ) = −1 (9.18)

Note that

• PS = −1 if P (SF (T ), T ) = max(E−SF (T )) is not implied by eqn. 9.18.We do not know a priori where SF (T ) is located. We need the extracondition that PS is a continuous function of S.

9.3 Dividends

We can analyze the case with dividends in the same way. Consider an Amer-ican call option–which satisfies eqn. 8.109 as long as exercise is not optimal:

−∂C∂t

= (r −D0)S∂C

∂S+

1

2σ2S2∂

2C

∂S2− rC

(8.109)

withC(S, 0) = max(0, S − E), C(S, T ) ≥ max(0, S − E) (9.19)

If there exists an optimal exercise boundary then for S = SF (T )

C(SF (T ), T ) = SF (T ) − E and∂

∂SC(SF (T ), T ) = 1 (9.20)

and eqn. 8.109 is valid only for C(S, T ) > max(0, S − E), otherwise theinequality holds.

To analyze this case we proceed as follows. Assume that

r > D0 > 0 (9.21)

116

and letS

E= ex, te − t =

τ12σ2

(9.22)

andC = S − E + Ec(x, τ) (9.23)

The result is that eqn. 8.109 transforms into the forward equation:

∂c

∂τ= (κ2 − 1)

∂c

∂x+∂2c

∂x2− κ1c+ f(x) (9.24)

on τ > 0, −∞ < x <∞, with initial condition

c(x, 0) = max(0, 1 − ex) =

1 − ex if x < 00 otherwise

(9.25)

where

κ1 =r

12σ2, κ2 =

r −D0

12σ2

(9.26)

andf(x) = (κ2 − κ1)e

x + κ1 (9.27)

Figure 9.4 shows the details of these functions.

c(x,0)

x

(a)

x

f(x)

x0

(b)

Figure 9.4: (a) The function c(x, 0) (b) The function f(x). See text for details.

Suppose a free boundary exists at x = xF (τ). Then

c(xF (τ), τ) =∂

∂xc(xF (τ), τ) = 0

117

andc(x(τ), τ) ≥ max(0, 1 − ex)

But f(x) > 0 for x < x0 where

x0 = logκ1

κ1 − κ2= log

r

D0> 0

and f(x) ≤ 0 for x ≥ x0.

Suppose there were no constraints and no free boundary. Look at the initialdata c(x, 0) for x > 0. Then c(x, 0) = cx(x, 0) = cxx(x, 0) = 0, so at expiryτ = 0, cτ = f(x). It follows that for 0 < x < x0, f(x) > 0 so c increases;whereas for x > x0, f(x) < 0 so c decreases. But this violates the constraintc ≥ max(0, 1 − ex) > 0 for x > 0. Thus if we were to hold the option inx > x0 the constraint would be violated and c < max(0, 1 − ex). This isimpossible for an American call. Thus there must exist an optimal exerciseboundary. In fact such a boundary must be located at x0, i.e.

xF (0+) = x0

since this is the only point consistent with the condition c(xF (0+), 0+) = 0.In terms of the original variables this corresponds to the condition

SF (0+) =r

D0E (9.28)

which is independent of the variance parameter σ. Thus immediately beforeexpiry the option should be exercised at values of S such that

D0S > rE (9.29)

• It follows from this condition that if D0 = 0 then SF (0) = ∞ and thereis no free boundary. Thus without dividends it is always optimal tohold an American call to expiry.

• The point

S =r

D0E

is such thatLmax(0, S − E) = 0

118

9.4 A local analysis of the free boundary

How does the free boundary x = xF (τ) move away from the location xF (0) =x0? We cannot solve this problem exactly, but we can find an asymptoticsolution which is valid close to τ = 0. Consider eqn. 9.24 near x = x0, τ = 0.In such a case

f(x) = f(x0) + f ′(x0)(x− x0) +O((x− x0)2)

∼ f ′(x0)(x− x0) = −k1(x− x0)

We note that in a region where c changes rapidly, cxx cx, c so we can usethe approximate equation:

cτ = cxx − k1(x− x0) (9.30)

with c = cx = 0 on x = xF (τ) and xF (0) = x0. We can obtain an exactsolution to this local problem using a similarity solution in terms of:

c = τ 3/2c∗(ξ); ξ =x− x0√

τ(9.31)

withxF (τ) = x0 + ξ0

√τ (9.32)

Thus we need only find ξ0 to solve for xF (τ).

We first substitute eqns. 9.31 and 9.32 into 9.30 to obtain:

√τ

(3

2c∗ − 1

2ξc∗ξ

)=

√τc∗ξξ − k1(x− x0)

which reduces to:

c∗ξξ +1

2ξc∗ξ −

3

2c∗ = k1ξ (9.33)

with

c∗(ξ0) = c∗ξ(ξ0) = 0 (9.34)

How does c∗(ξ) behave as ξ → −∞? As x → −∞, both c and c changeslowly, not rapidly; in fact cxx → 0 as x → −∞, whence cτ ∼ −k1x soc ∼ −k1xτ and c∗ ∼ −k1(x− x0)τ

−1/2 = −k1ξ, as ξ → −∞.

119

We first solve eqn. 9.33 for the homogeneous case, i.e.

c∗ξξ +1

2ξc∗ξ −

3

2c∗ = 0 (9.35)

One exact solution can be found by trial and error, namely:

c∗1(ξ) = ξ3 + 6ξ (9.36)

A second independent solution may be obtained by reduction of order. Let

c∗2(ξ) = c∗1(ξ)a(ξ) (9.37)

and solve the resulting first order ordinary differential equation for a(ξ).

To determine c∗2(ξ) we proceed as follows. We know from Abel’s theoremthat the Wronskian associated with eqn. 9.35 is

W (c∗1, c∗2)(ξ) ≡

∣∣∣∣c∗1 c∗2c∗′1 c∗′2

∣∣∣∣ = C exp

[−1

4ξ2

]

i.e.

c∗1c∗′2 − c∗2c

∗′1 = C exp

[−1

4ξ2

]

or

(ξ3 + 6ξ)c∗′2 − c∗2(3ξ2 + 6) = C exp

[−1

4ξ2

](9.38)

This is a first order linear ordinary differential equation with variable coeffi-cients. Because of its linearity we try a solution of the form:

c∗2 = A(ξ) exp

[−1

4ξ2

]+B(ξ)

∫ ξ

−∞exp

[−1

4s2

]ds (9.39)

where A(ξ) and B(ξ) are polynomials in ξ.

On substituting eqn. 9.39 into 9.38 we obtain

(ξ3 + 6ξ)

(A′ − 1

2ξA+B

)− A(3ξ2 + 6) = C

(ξ3 + 6ξ)B′ − B(3ξ2 + 6) = 0 (9.40)

The solution to the second equation is

B(ξ) = D(ξ3 + 6ξ)

120

whence(A′ − 1

2ξA

)(ξ3 + 6ξ) − (3ξ2 + 6)A = C −D(ξ3 + 6ξ)2 (9.41)

Note that when ξ = 0 this reduces to A(0) = C/6. Inspection of eqn. 9.41suggests we try

A(ξ) = ξ2 +1

6C

whence

2ξ(ξ3 + 6ξ) −(

1

2ξ4 + 6ξ2 + 6

)(ξ2 +

1

6C) = −C −D(ξ6 + 12ξ4 + 36ξ2)

Evidently D = 1/2 and therefore C = 24, so that:

c∗2(ξ) = (ξ2 + 4) exp

[−1

4ξ2

]+

1

2(ξ3 + 6ξ)

∫ ξ

−∞exp

[−1

4s2

]ds (9.42)

• There is a misprint on p. 117 of Wilmott et.al.

• The Wronskian W (c∗1, c∗2)(ξ) = exp

[−∫

12ξdξ]

= exp[−1

4ξ2]

[see Boyce& Di Prima: Elementary Differential Equations and Boundary Value

Problems Chapter 3] whence

(c∗2c∗1

)′= C

exp[−1

4ξ2]

(c∗1)2

(9.43)

for c∗1 and c∗2 given by eqns. 9.36 and 9.42 where C = −24.

It follows that

c∗h(ξ) = A(ξ3 + 6ξ)

+B

((ξ2 + 4) exp

[−1

4ξ2

]+

1

2(ξ3 + 6ξ)

∫ ξ

−∞exp

[−1

4s2

]ds

)

(9.44)

solves the homogeneous equation with arbitrary constants A and B.

121

To solve the inhomogeneous problem, eqn. 9.33 we note that

c∗p(ξ) = −k1ξ (9.45)

is a particular solution, so the general solution of eqn. 9.33 is:

c∗(ξ) = −k1ξ + A(ξ3 + 6ξ)

+B

((ξ2 + 4) exp

[−1

4ξ2

]+

1

2(ξ3 + 6ξ)

∫ ξ

−∞exp

[−1

4s2

]ds

)

(9.46)

We can determine the value of the coefficients A and B as follows. We firstnote that as ξ → −∞, c∗2(ξ) → 0, but c∗1(ξ) → ∞ like ξ3. Since c∗(ξ) → −k1ξas ξ → −∞, we need A = 0.

In addition it follows from eqn. 9.34 that:

c∗(ξ0) = 0 = −k1ξ0 +Bc∗2(ξ0)

i.e.Bc∗2(ξ0) = k1ξ0 (9.47)

and

c∗ξ(ξ0) = 0 = −k1 +Bd

dξc∗2(ξ0)

whence

Bd

dξc∗2(ξ0) = k1 (9.48)

It follows from eqns. 9.47 and 9.48 that

ξ0d

dξc∗2(ξ0) = c∗2(ξ0) (9.49)

whence from eqn. 9.42:

3ξ20 exp

[−1

4ξ20

]+

1

2(3ξ3

0 + 6ξ0)

∫ ξ0

−∞exp

[−1

4s2

]ds

= (ξ20 + 4) exp

[−1

4ξ20

]+

1

2(ξ3

0 + 6ξ0)

∫ ξ0

−∞exp

[−1

4s2

]ds

122

which reduces to:

ξ30 exp

[1

4ξ20

]∫ ξ0

−∞exp

[−1

4s2

]ds = 2(2 − ξ2

0) (9.50)

Eqn. 9.50 is not algebraic but transcendental. It can be solved numericallyfor ξ0. The result is:

ξ0 = 0.9034 · · · (9.51)

then eqn. 9.47 determines B, hence c∗(ξ) and c(x, τ) etc. Thus we haveobtained a local solution to the value of an American call near the exerciseboundary and near expiry.

The final step is to note that:

SF (0+) =r

D0E

(9.28)

Thus the local analysis indicates that as τ → 0

SF (τ) ∼ r

D0E

(1 + ξ0

√1

2σ2τ + · · ·

)(9.52)

The effect of the rapid changes in SF (τ) extends non–locally, as shown below:

c(x,τ)

x

(a)

0 S0 E

C(S,τ)

(b)

Figure 9.5: (a) The function c(x, 0) (b) The function f(x). See text for details.

123

Chapter 10

Dividends revisited

We now return to a consideration of dividends. We consider only determin-istic dividend payments, either continuous or discrete in time. Continuouspayments are associated with indices comprising a large number of shares,e.g. the FT-SE index; whereas discrete dividend payments are associatedwith a single equity.

Suppose that in time dt the underlying pays out a dividend

D(S, t)dt

then the basic random walk for S:

dS = µSdt+ σSdW (3.19)

is modified to:dS = (µS −D(S, t))dt+ σSdW (10.1)

It can be shown that this leads to the modified Black–Scholes equation for aEuropean option:

−∂V∂t

= (rS −D)∂V

∂S+

1

2σ2S2∂

2V

∂S2− rV (10.2)

10.1 Fixed dividend payments

Consider first fixed dividend payments.

D(S, t) = D(t)S (10.3)

124

and let

S = S exp

[∫ te

t

D(s)ds

](10.4)

Then Ito’s lemma applied to eqns. 10.1 and 10.3 gives the stochastic differ-ential equation:

dS

S= µdt+ σdW (10.5)

Thus we recover the lognormal form for S by discounting the asset price Sby the factor

exp

[∫ te

t

D(s)ds

]

• If D(t) = D0 the asset price is discounted by

exp

[∫ te

t

D0ds

]= exp [D0T ]

The asset price therefore drops continuously, i.e., it decays as

S(t) exp [−D0T ]

as expected

• Of course the holder of the asset receives dividend payments of theform:

S(t) (1 − exp [−D0T ])

so the total value of the asset to the holder is S(t) otherwise therewould be arbitrage opportunities.

• When the dividend payment is discrete of the form:

Dyδ δ(t− td)S

then

S = S exp

[∫ te

t

Dyδ δ(s− td)ds

]

= S exp [DyδH(t− td)] (10.6)

and the asset is discounted by

exp [DyδH(t− td)] (10.7)

125

• If a company pays out 50% of S at time td this discretely paid dividendyield gives gives exp[−Dy

δ ] = 0.5, Dyδ = log 2.

• Let t−d be the time just before the time dividend payment, and t+d thetime just after. Then the asset jumps from a value S(t−d ) to

S(t+d ) = S(t−d ) exp[−Dyd] = 0.5S(t−d ) (10.8)

if Dyδ = log 2.

10.2 Jump conditions

It follows from the above that S(t) follows a random walk with intermittentdownward jumps whenever there is a discrete dividend payment. Jump con-ditions relate S(t−d ) to S(t+d ). What happens to the value of an option on Sthrough a jump? Let this value be V (S, t). Evidently in order to eliminatearbitrage opportunities

V (S(t−d ), t−d ) = V (S(t+d ), t+d ) (10.9)

i.e. the option value should be continuous in time, for any realization of theasset’s random walk. Combining this with eqn. 10.8 we see that:

V (S, t−d ) = V (S exp[−Dyδ ], t

+d ) (10.10)

Thus V is continuous across a dividend payment date even though S isdiscontinuous.

• Since the holder of the option doesn’t receive any benefit from thedividend payment, this must (somehow) be reflected in its price. Infact the effect of dividend payments and the jumps in asset prices ispropagated by the Black–Scholes equation throughout the life of theoption.

Suppose the discrete dividend payment takes the form Dδ(S) then eqn. 10.1gives:

dS = (µS −Dδ(S)δ(t− td))dt+ σSdW (10.11)

126

Thus across a dividend date:

∫ S+

S−

dS

Dδ(S)=

∫ t+d

t−d

[(µS −Dδ(S)δ(t− td)) + σSdW ]Dδ(S)−1dt

Since t+d − t−d = O(ε) the only term in the right hand side of the aboveexpression which survives is:

−∫ t+

d

t−d

δ(t− td)dt = −1

whence ∫ S+

S−

dS

Dδ(S)= −1 (10.12)

• Note that S+ < S−

• Once again eqn. 10.9 applies across the dividend date.

Now suppose thatD(S, t) = Dp

δδ(t− td)

Then across t = td we have:

S+ = S− −Dpδ (10.13)

In such a case S can go negative. This is to be avoided, for example bychoosing

D(S, t) = DyδSδ(t− td)

as before, or in general:

D(S, t) = Dδ(S)δ(t− td)

in which case the drop in S is given by

∫ S+

S−

dS

Dδ(S)= −1

and we need a condition on Dδ(S) to guarantee that S+ ≥ 0. Since

127

∫ S−

S+

dS

Dδ(S)= +1

it follows that: ∫ S−

0

dS

Dδ(S)= 1 +

∫ S+

0

dS

Dδ(S)(10.14)

Now if S+ = 0 the integral on the right hand side vanishes, so:

∫ S−

0

dS

Dδ(S)= 1

But if, for example Dδ(S) → O(S) as S → 0 then

∫ S−

0

dS

Dδ(S)∼∫ S−

0

dS

S= ∞

which leads to a contradiction, so S+ 6= 0. In fact S+ must be > 0. So asufficient condition for S+ to be positive is that

∫ S−

0

dS

Dδ(S)

be unbounded for all S− > 0, for example:

D(S, t) → O(S), S → 0

10.3 An alternative derivation of the jump

condition

With discrete dividends the modified Black–Scholes equation becomes:

−∂V∂t

= (rS −Dδ(S)δ(t− td))∂V

∂S+

1

2σ2S2∂

2V

∂S2− rV (10.15)

• Evidently when t 6= td this is just the Black–Scholes equation withoutdividends

128

• However when t = Td the term

−Dδ(S)δ(t− td)∂V

∂S

becomes large and must be balanced by another term in the equation–infact by the term Vt.

Thus near t = td eqn. 10.15 reduces to the first order partial differentialequation:

∂V

∂t−Dδ(S)δ(t− td)

∂V

∂S= 0 (10.16)

which can be rewritten in the form:

(1,−Dδ(S)δ(t− td)) (Vt, VS) = 0

i.e. as~u ∇V = 0

where ~u is the vector (1,−Dδ(S)δ(t − td)) and ∇V ≡ gradV . But ~u ∇ isthe directional derivative in the direction ~u. Thus

dV = Vtdt+ VSdS = Vt −Dδ(S)δ(t− td)VS = 0

So V is constant in the direction given by the equation:

dS

dt= −Dδ(S)δ(t− td) (10.17)

This equation can be integrated to give

∫ S(td)

S(t)

dS

Dδ(S)=

∫ td

t

δ(s− td)ds = H(t− td) (10.18)

Eqn. 10.18 is closely related to eqn. 10.12 which defines the downward jumpin S. Thus V (S, t) is constant across the jump in S.

• As S follows a continuous random walk in time, the value of the optionV (S, t) meanders smoothly on the surface V above the (S, t) plane. Ata discontinuity in S from S(t−) to S(t+), V remains constant otherwisethere exist arbitrage opportunities; however eqn. 10.10 implies that Vis only a piecewise continuous function of S.

129

10.4 The meaning of jump conditions

Consider the ordinary differential equation

d

dtv(t) = I(t)

Evidently

v(t) =

∫ t

te

I(s)ds

so if I(t) has a finite jump, for example:

I(t) = H(t− td)

then

v(t) =

∫ t

te

H(s− td)ds = R(t− td)

where

R(t) =

0, t < tdt, t > td

Note that v is continuous at t = td, i.e.

v(t−d ) = v(t+d )

for t−d , t+d within ε–neighborhoods of td.

We now note that eqn. 10.15 can be written in the form:

−∂V∂t

= L∗BS (10.19)

and that the change in S at td can be solved as:∫ S(t)

S(td)

dS

Dδ(S)= −H(t− td) (10.20)

to give a finite jump in S. It follows from eqn. 10.19 that

V (t) =

∫ t

te

L∗BSV (s)ds (10.21)

so that for a finite jump in S at td

V (S−, t−d ) = V (S+, t+d ) (10.22)

i.e. even though V changes in response to jumps in S it still remains contin-uous at td. Such a smoothing of V in response to jumps in S is characteristicof a differential equation like the Black–Scholes equation.

130

Chapter 11

A generalization

As we have seen in case D(S, t) = D0S the Black–Scholes equation can bewritten as:

−∂V∂t

= (r −D0)S∂V

∂S+

1

2σ2S2∂

2V

∂S2− rV

(8.109)

with final condition for a call

C(S, T ) = max(0, S − E)(8.110)

and boundary conditionsC(0, T ) = 0

(8.111)

andC(S, T ) → Se−D0T

(8.112)

as S → ∞, where T = te − t.

Now letS = S exp [D0T ] (11.1)

Then eqn. 8.109 reduces to

−∂V∂t

= rS∂V

∂S+

1

2σ2S2∂

2V

∂S2− rV (11.2)

which is just the original form of the Black–Scholes equation, eqn. 8.23.

131

• It follows that the formulas for European puts and calls can be eas-ily derived from the zero dividend case via the substitution given ineqn. 11.1.

• The discounting in eqn. 11.1 reflects the fact that the holder of S re-ceives the dividend income, whereas the holder of V does not.

In similar fashion ifD(S, t) = D(t)S (11.3)

Then

C(S, T ) → S exp

[−∫ te

t

D(s)ds

](11.4)

as S → ∞. So now we use

S = S exp

[−∫ te

t

D(s)ds

](11.5)

in the formulas.

11.1 Interest rate and volatility known func-

tions of time

In case r and σ are known functions of time eqn. 10.2 becomes:

−∂V∂t

= (r(t) − D(t))S∂V

∂S+

1

2σ2(t)S2∂

2V

∂S2− r(t)V (11.6)

LetS = S exp[α(t)], V = V exp[β(t)], t = γ(t) (11.7)

On performing these substitutions in eqn, 11.6 we obtain:

γ(t)∂V

∂t+

1

2σ(t)2S2 ∂V

∂S2

+ (r(t) − D(t) + α(t))S∂V

∂S− (r(t) + β(t))V = 0 (11.8)

where ˙≡ d/dt. We can eliminate the coefficients of VS and V via the choices:

132

α(t) =

∫ te

t

(r(s) − D(s))ds, β(t) =

∫ te

t

r(s)ds (11.9)

and the remaining t–dependence via

γ(t) =

∫ te

t

σ2(s)ds (11.10)

The result is the simple diffusion equation

∂V

∂t=

1

2S2∂

2V

∂S2(11.11)

If V (S, T ) is any solution of this time–independent equation (which containsno reference to σ, r, or D) then the corresponding solution of eqn. 11.8 inthe original variables is:

V = exp[−β(t)]V (S exp[α(t)], γ(t)) (11.12)

Now let VBS be any solution of eqn. 8.23 with constant r and σ and zerodividend payments. It now follows that:

VBS = exp[−rT ]VBS(S exp[rT ], σ2T

)(11.13)

for some function VBS.

Thus we can immediately write down a solution for r(t), σ2(t) and D(t) 6= 0in terms of VBS via the replacements:

r →< r(t) >=1

T

∫ te

t

r(s)ds, σ2 →< σ2(t) >=1

T

∫ te

t

σ2(s)ds (11.14)

and the discounting

S → S exp

[−∫ te

t

D(s)ds

](11.15)

11.2 Trading volatility

In practice σ is neither constant nor predictable for T larger than a fewmonths. So one trading strategy is to use implied volatility–i.e. givenS,E, r, T and the quoted value of V we can deduce σ from the Black–Scholes

133

formula, or we can calculate σs from the prices of all options on S at thesame T and then buy that one with the lowest σ, and sell that one with thehighest σ. The expectation is that prices move so that implied volatilitesbecome comparable, and the portfolio will make a profit. More sophisticatedmodelling involves using stochastic volatility. However it is difficult to hedgethe randomness due to stochastic changes in the volatility σ.

134

Chapter 12

Exotic options

An exotic option is something more complicated than a “vanilla” call or put.

• A path–dependent option is such that its payoff at exercise or expirationdepends on the past history of S, as well as on S(0) and E. Example:An American call or put.

• A simple exotic option is the binary or digital option—the payoff de-pends upon whether S(0) > E (for a call) or S(0) < E (for a put).

• Another simple exotic option is the barrier option in which the rightto exercise the option is forfeited if S crosses a barrier SB (an “out”barrier option) or else it is enabled if S crosses SB (an “in” barrieroption). Examples include the “down–and–out” barrier option, or the“up–and–in” barrier option.

Usually exotic options are not quoted on an exchange, but are traded “over–the–counter” by options brokers. Table 12.1 shows a list of common exoticoptions, some of which are path–dependent. There are many other exoticoptions.

135

Option Exotic Path–Dep

binary •compound •chooser •barrier • •Asian • •barrier • •

Table 12.1: Common exotic or path–dependent options

• Binary options: A binary option differs from a vanilla option in thatthe payoff at expiry (or exercise if it is an American option) can be anarbitrary non–negative function of S.

Let the payoff function be Λ(S). Then from eqn 7.16 we have:

V (S, T ) =exp[−rT ]

σ√

2πT

∫ ∞

0

Λ(S ′) exp

[−(log S ′/S − (r − 1

2σ2)T )2

2σ2T

]dS ′

S ′

(12.1)from the Feynman–Kac formula, for the value of a European binaryoption. This includes vanilla puts and calls as special cases. Thisassumes r and σ constant and D = 0, but we can easily extend it usingeqns. 11.14 and 11.15, if r(t) and σ(t) are known, and D(t) 6= 0.

Two popular binary options are (a) the “cash–or–nothing call”:

Λ(S) = BH(S − E) (12.2)

and (b) the “supershare”:

Λd(S) =1

d[H(S − E) −H(S − E − d)] (12.3)

Fig. 12.1 shows the details of these payoff functions. Evidently

limd→0

Λd(S) = Λ0(S) = δ(S − E)

These options are easy to value, but difficult to hedge near T = 0 sincethey are discontinuous. Since the option price V depends on the payofffunction Λ(S) it follows that:

∆ =∂

∂SV (Λ(S)) = VΛΛS

136

E S

B

Λ

E E + d

1 / d

S

Λ

(a) (b)

Figure 12.1: (a) Cash–or–nothing call (b) Supershare

and therefore for a cash–or–nothing call we have:

∆ = BVΛδ(S − E) (12.4)

So the delta is zero if S 6= E and one need not hedge the portfolio nearT = 0. However if S ∼ E near T = 0 then there is a high probabilitythat S will cross E, perhaps many times, so that ∆ will vary fromzero to unbounded many times near T = 0. It is not possible to hedgein such a situation, and therefore to value such an option with theBlack–Scholes formula.

In some cases American binary options are simple to value. Supposethe payoff function is that given in eqn. 12.2, and that the payoff forearly exercise is also Λ(S). In such a case one would exercise as soonas S > E. There is a clear disadvantage in not doing so:

– S might fall back to a value less than E

– There is a loss of potential interest on the payoff which is notcompensated for by the prospect of a larger payoff

Thus the free boundary is always at S = E.

To value V (S, T ) is straightforward. If S > E, V = B. If S < E,its value is given by the Black–Scholes formula with V (E, T ) = B

137

and V (0, T ) = 0 and V (S, 0) = 0. The solution is shown in fig. 12.2.Evidently ∆ = VS is discontinuous for the American binary option.

E S

B

Λ

0

American

European

Figure 12.2: Cash or nothing call: E = 10, B = 5, r = 0.1, σ = 0.4, D0 = 0.2

This is not really an exception to the usual smoothness condition–thepayoff is itself discontinuous, i.e. the equivalent obstacle itself has adiscontinuity, so arbitrage arguments break down.

• Compound options: A compound option is an option on an option, e.g.a call on a call, a call on a put etc. Let T1 be the time at which wecan exercise the compound option (if we wish) to buy the underlyingvanilla option for E1. This vanilla option may exercised at T2 for anamount E2 in return for an asset worth S. Such a compound option issimple to value via the Black–Scholes formula: (a) given expiry at T2

we find the value of the call we buy at T1. This has strike price E2 andexpires at T2. Let its value be C1(S, T1). (b) If C1(S, T1) > E1 then wewould exercise our compound option and buy the underlying call. IfC1(S, T1) < E1 then we would not buy the underlying call. Thus thepayoff for the compound option at T1 is given by the function:

max(0, C1(S, T1) − E1) (12.5)

Because the compound option’s value is governed only by the randomwalk of S it satisfies the Black–Scholes equation with final condition:

C(S, 0) = max(0, C1(S, T1) − E1) (12.6)

138

Thus we can determine C(S, T ) for times before T1, i.e for T = T1−t >0.

• Chooser options: A chooser or “as–you–like–it” option gives its holderthe right to buy for an amount E1 at T1 either a call or a put withstrike price E2 at T2. It can be valued similarly to a compound option.(a) Solve the underlying option problems, one for the call, the otherfor the put, to give C1(S, T ) and P1(S, T ) and use these values at T1 asfinal data for the first option problem. (b) Evidently if C1(S, T1) > E1

or P1(S, T1) > E1 then we would exercise the chooser option and buythe more valuable one. If neither exceeds E1 then we would not buyeither option. Thus the payoff function for the chooser option at T1 isjust

max (0, C1(S, T1) − E1, P1(S, T1) − E1) (12.7)

All we need do is solve the usual Black–Scholes equation with finaldata:

C(S, 0) = max (0, C1(S, T1) − E1, P1(S, T1) − E1) (12.8)

• Barrier options: There are four kinds (a) “Up–and–in”, (b) “Down–and–in”, (c) “Up–and–out”, and (d) “Down–and–out”. Such optionsactivate or expire when a threshold or barrier S = STh is reached. Theycan be either puts or calls. In some cases a rebate is provided, usuallya fixed amount to the holder, if the barrier is reached.

• Asian options: These are the first fully path–dependent exotic optionsto consider. They have payoffs which depend on the past history ofthe random walk in S via some sort of average. For example, one suchoption is the average strike call, whose payoff is:

max(0, S(0)− < S >) (12.9)

where

< S >=1

τ

∫ te

t0

S(s)ds (12.10)

where τ = te−t0 is the trading period Several factors affect this average:

– the period

139

– arithmetic or geometric averaging–< S > or exp < logS >

– weighted or unweighted averaging–

1

τ

∫ te

t0

S(s)ds or

∫ te

t0

w(t− s)S(s)ds

– discrete or continuous sampling of S–it is easier to average over asmaller number of samples

• Lookback options: Such an option has a payoff which depends not onlyon S(te) but on Smax and Smin over the trading period τ , e.g.

max(0, Smax − S) (12.11)

12.1 A unifying framework

Consider first a fairly general class of European options with payoff dependingon S and on: ∫ te

t0

f(S(s), s)ds (12.12)

where f is a given function of S and t. The integration is taken over thepath of S from s = t0 to s = te.

Example: the average strike call option. The payoff function is

max

(0, S − 1

τ

∫ te

t0

S(s)ds

)(12.13)

Let

I =

∫ t

t0

f(S(s), s)ds (12.14)

Since S is Markovian we can treat I, S and t as independent variables so thatdiffering realizations of the random walk lead to different values of I. Thusthe value of an exotic path–dependent option can be written as V (S, I, t)with I as a second random variable. It follows that:

I(t+ dt) = I + dI =

∫ t+dt

t0

f(S(s), s)ds

140

so to O(dt)

I + dI =

∫ t

t0

f(S(s), s)ds+ f(S(t), t)dt

whencedI = f(S, t)dt (12.15)

Note that there is no term involving dW , i.e. there is no random componentin this equation.

We now apply Ito’s lemma to V (S, I, t) to give:

dV =

(∂V

∂t+ µS

∂V

∂S+

1

2σ2S2∂

2V

∂S2+ f(S, t)

∂V

∂I

)dt+ σS

∂V

∂SdW (12.16)

This option is European, so we can set up a risk–free portfolio comprisingone such option and a short position with −∆ shares of S, with ∆ = VS. Asusual arbitrage considerations lead to a modified Black–Scholes equation ofthe form:

−∂V∂t

= (rS −D(S, I, t))∂V

∂S+

1

2σ2S2∂

2V

∂S2+ f(S, t)

∂V

∂I− rV (12.17)

where D(S, I.t) is a path–dependent dividend payment.

Boundary conditions depend on the particular form of the option, but thefinal condition is more general: At expiry we know that:

V (S(0), I) = Λ(S(0), I)

where Λ is the given payoff function. Thus for the average strike call

I =

∫ t

t0

S(s)ds

and therefore:

Λ(S(0), I) = max(0, S(0) − I(te)

τ) (12.18)

This analysis can now be extended to American options–e.g. an Americanversion of the average strike option. In such a contract the payoff on earlyexercise is specified in advance. Suppose it takes the form:

max(0, S − I

t− t0)

141

Note thatI

t− t0=< f(S, t) > (12.19)

is the running average of f .

Let

LEX ≡ ∂

∂t+ (rS −D(S, I, t))

∂

∂S+

1

2σ2S2 ∂

2

∂S2+ f(S, t)

∂

∂I− r (12.20)

This operator is a generalization of the Black–Scholes operator:

LBS ≡ ∂

∂t+ rS

∂

∂S+

1

2σ2S2 ∂

2

∂S2− r (12.21)

and measures the difference between the rates of return onn a risk free ∆–hedged portfolio and a bank deposit of equivalent value. Such a differencecannot be positive, i.e.

LEXV ≤ 0 (12.22)

In addition arbitrage restrictions give:

V (S, I, t) ≥ Λ(S, I, 0) (12.23)

• If LEXV = 0, then it is not optimal to exercise the option. This canonly be the case if it is more valuable held than exercised, i.e. V > Λ.

• Conversely if LEX < 0 it is optimal to exercise the option. This canonly be the case if V = Λ

12.2 Discrete sampling

We saw earlier that discrete dividends players lead to jump conditions. Thesame outcome holds for discrete sampling. Let ti, i = 1, 2, · · · , N denote Ndiscrete sampling times. VEX will now depend on the discrete sum

I =N∑

i=1

fi(S(ti)) (12.24)

for some function f .

142

Example: An Asian option with a payoff function that depends on

1

N

N∑

i=1

S(ti)

Thus V depends on

I =

j(t)∑

i=1

S(ti)

where j(t) is the largest integer such that tj(t) ≤ t. With this definition of I,across any sampling date, V must satisfy the jump condition:

V (S, I, t−i ) = V (S, I + S, t+i ) (12.25)

This follows either from arbitrage considerations or from a mathematicalargument similar to that in § 10.3. This latter argument follows from theequivalence:

N∑

i=1

S(ti) =

∫ t

t0

S(s)N∑

i=1

δ(s− ti)ds (12.26)

so thatN∑

i=1

fi(S(ti)) =

∫ t

t0

f(S(s), s)N∑

i=1

δ(s− ti)ds (12.27)

12.3 Barrier options

Barrier features may be applied to any option. Consider for example theEuropean down–and–out call shown in fig. 12.3: The payoff is just max(0, S−E) and C = 0 if S < X. We consider only the case E > X. As long asS > X, C follows the Black–Scholes equation with final data C(S, 0) =max(0, S−E). As S increases the likelihood of a barrier crossing diminishes,so that C(S, T ) ∼ S as S → ∞. However the second boundary condition isapplied at S = X, where:

C(X, T ) = 0 (12.28)

To solve this problem we make the usual change of variables:

S

E= exp x, T = te − t =

τ12σ2

(12.29)

143

S

t

X

Figure 12.3: The setup for a down–and–out European call. C = 0 when S < X.

andC = E exp[αx+ γτ ]u(x, τ) (12.30)

where

α = −1

2(κ1 − 1), γ = −1

4(κ1 + 1)2, κ1 =

r12σ2

(12.31)

Eqn. 8.23 then reduces to the simple forward diffusion equation:

∂u

∂τ=∂2u

∂x2(12.32)

with initial data:

u(x, 0) = max

(0, exp

[1

2(κ1 + 1)x

]− exp

[1

2(κ1 − 1)x

])

= u0(x), x ≥ logX

E(12.33)

and boundary conditions:

u(x, τ) ∼ exp[(1 − α)x− γτ ] as x → ∞ (12.34)

and

u

(log

X

E, τ

)= 0 (12.35)

We recognize this last boundary condition as an example of the absorbingboundary condition introduced in § 4.1.

144

To solve this problem we use the method of images introduced by Kelvin innineteenth century electrostatics. The problem is also equivalent to the flowof heat in a semi–infinite bar extending from x = 0 to x = ∞. We thereforefirst consider the problem on the domain D = (0,∞) for τ > 0 with initialdata u(x, 0) = φ(x) and boundary condition u(0, τ) = 0. We introduce theodd function:

ψ(−x) ≡ −ψ(x), ψ(0) = 0 (12.36)

shown in fig. 12.4 and let φODD(x) be the (unique) extension of the initial

x- x

ψ ( x )

Figure 12.4: The odd function ψ(x).

data u(x, 0) = φ(x) to the domain (−∞,∞), i.e.

φODD(x) =

φ(x), x > 0−φ(−x), x < 00, x = 0

(12.37)

Now let u(x, τ) be the solution of

∂u

∂τ= κ

∂2u

∂x2, u(x, 0) = φODD(x) (12.38)

on (−∞,∞), 0 < t <∞, i.e.

u(x, τ) =

∞∫

−∞

S(x− y, τ)φODD(y)dy (12.39)

145

where

S(x, τ) =1

2√πκτ

exp

[− x2

4κτ

](12.40)

Its restriction:v(x, τ) = u(x, τ), x > 0 (12.41)

is the unique solution of the half–line problem. To see this note that u(x, τ)is odd since S is even and φODD is odd. So u(−x, τ) = −u(x, τ) whenceu(0, τ) = 0. Its restriction v(x, τ) automatically solves the half–line problemsince it equals u(x, τ) for x > 0 and therefore satisfies eqn. 12.38 and therequired initial and boundary conditions.

It follows that

u(x, τ) =

∫ ∞

0

S(x− y, τ)φ(y)dy−∫ 0

−∞S(x− y, τ)φ(−y)dy

=

∫ ∞

0

[S(x− y, τ) − S(x + y, τ)]φ(y)dy (12.42)

so for 0 < x <∞, 0 < τ <∞

v(x, τ) =1

2√πκτ

∫ ∞

0

[exp

[−(x− y)2

4κτ

]− exp

[−(x + y)2

4κτ

]]φ(y)dy (12.43)

This is known as Kelvin’s method of images since the Green’s function S(x, τ)is reflected about the axis x = 0. We use this method to solve eqn. 12.32with the given domain and absorbing boundary condition. We first note thatthe equation is invariant under translations and reflections of the coordinateframe, so that if u(x, τ) is a solution, then so are u(x±x0, τ) and u(−x±x0, τ)for any constant x0. We use this property to satisfy the absorbing barriercondition, eqn. 12.35, by first solving eqn. 12.32 on D = −∞ < x < ∞ interms of two infinite problems, with equal and opposite initial data, so thatat x = 0 (the reflection point) the net effect of initial and later data cancels.In the present case the reflection point is at x = x0 = logX/E. Thus wehave to solve eqn. 12.32 not on

logX

E< x <∞

146

but on −∞ < x <∞ with:

u0(x) = max(0, e(κ1+1)x/2 − e(κ1−1)x/2

), x > logX/E

−max(0, e(κ1+1)(logX/E−(1/2)x)/2 − e(κ1−1)(logX/E−(1/2)x/2)

),

x < logX/E

(12.44)

It now follows from eqn. 12.30 that the solution of the European call withno barriers is:

C(S, T ) = E exp[αx + γτ ]u1(x, τ) (12.45)

where u1(x, τ) solves the unrestricted problem on −∞ < x <∞. Thus:

u1(x, τ) = exp[−(αx + γτ)]C(S, T )

E(12.46)

and let the solution of the down–and–out barrier option problem be:

VDAO(S, T ) = E exp[αx+ γτ ](u1 + u2)(x, τ) (12.47)

where u2(x, τ) solves the unrestricted problem on −∞ < x < ∞ with ini-tial data antisymmetric to that generating u1. We now use the invariance

properties of eqn. 12.32. Since we require u1 + u2 = 0 at the reflection pointx0 = logX/E it follows that:

u2(x, τ) = −u1(2 logX

E− x, τ) = − exp[−(α(2 log

X

E− x) + γτ)]

C(S, T )

E

= − exp[−(α(2 logX

E− log

S

E) + γτ)]

C(X2

S, T )

E(12.48)

since

x→ 2 logX

E− x ≡ S → X2

S(12.49)

It follows that

VDAO(S, T ) = C(S, T ) −(S

X

)−(κ1−1)

C(X2

S, T ) (12.50)

EvidentlyVDAO(X, T ) = 0 (12.51)

147

S

t

X

Figure 12.5: The setup for a down–and–in European call. C = 0 when S > X.

and one can also show that the expression given above solves eqn. 8.23 withthe given final data.

As a second example we consider the European down–and–in call. Theoption price still satisfies the basic Black–Scholes equation:

LBSVDAI(S, T ) = 0 (12.52)

All we have to do is determine the correct final and boundary conditions.Evidently:

VDAI(S, T ) = 0 as S → ∞If S > X right up to expiry, then the option expires as worthless, so the finalcondition is:

VDAI(S, 0) = 0

If S = X before expiry the option becomes an ordinary vanilla call, i.e:

VDAI(X, T ) = C(X, T )

If S < X then:VDAI(S, T ) = C(S, T )

which (by assumption) we know. So we only have to solve the problem incase S > X with the above conditions.

LetVDAI(S, T ) = C(S, T ) − V (S, T ) (12.53)

148

Since the Black–Scholes equation is linear V (S, T ) must be a solution subjectto the final condition:

V (S, 0) = C(S, 0) − VDAI(S, 0) = max(0, S − E)

and boundary conditions:

V (S, T ) = C(S, T ) − VDAI(S, T ) ∼ S − 0 = S as S → ∞

V (X, T ) = C(X, T ) − VDAI(X, T ) = C(X, T ) − C(X, T ) = 0

But this is the problem for the down–and–out option. So

VDAI(S, T ) + VDAO(S, T ) = C(S, T ) (12.54)

This is obvious from a financial as well as a mathematical one.

The American versions of such barrier options exist and can be solved nu-merically.

12.4 Asian options

An Asian option is a contract giving the holder the right to buy an assetfor its average price over some prescribed period, e.g. buying a commodityat fixed times, and selling continually; or continually selling in one currencyand buying raw materials in another–the underlying being the exchange rate.Such options eliminate the need for continuous rehedging.

12.4.1 Continuously sampled averages

As we saw in § 12.1 the general equation for a path dependent option de-pending on the functional:

I =

∫ t

t0

f(S(s), s)ds (12.14)

is given in eqn. 12.17.

In case f = S we obtain a form of the Black–Scholes equation:

−∂V∂t

= (rS −D(S, I, t))∂V

∂S+

1

2σ2S2∂

2V

∂S2+ S

∂V

∂I− rV (12.55)

149

In this case the continuously sampled average is just:

< S >=1

t− t0I =

1

t− t0

∫ t

t0

S(s)ds (12.56)

and the discrete one is:1

N

N∑

i=1

S(ti) (12.57)

12.4.2 Geometric averaging

The geometric average is defined as:

< S >G= exp < log S >= exp

(1

t− t0

∫ t

t0

logS(s)ds

)(12.58)

This is the limit as N → ∞ of

(N∏

i=1

S(ti)

)1/N

In such a case

I =

∫ t

t0

log S(s)ds (12.59)

and the Black–Scholes equation becomes:

−∂V∂t

= (rS −D(S, I, t))∂V

∂S+

1

2σ2S2∂

2V

∂S2+ logS

∂V

∂I− rV (12.60)

Geometric averages are fairly similar to arithmetic ones.

12.4.3 Discretely sampled averages and jump condi-

tions

Discretely sampled averages naturally give rise to jump conditions. Considerthe discretely sampled arithmetical running sum:

I =

j(t)∏

i=1

S(ti) (12.61)

150

and let Ii be the value of I for ti < t < ti+1 so that:

Ii = Ii−1 + S(ti) (12.62)

So Ii is constant from t+i to t−i−1, i.e. it is (effectively) a parameter for V .During this period only S is (randomly) varying, so V satisfies the Black–Scholes equation. Since I is discontinuous at a sample ti we have a jumpcondition at ti:

V(S(t+i ), Ii, t

+i

)= V

(S(t−i ), Ii−1, t

−i

)(12.63)

This analysis can be applied to any option which depends on a discretelyupdated parameter. For example, if:

I =

j(t)∑

i=1

log S(ti) (12.64)

thenV(S(t+i ), Ii−1 + log S(t−i ), t+i

)= V

(S(t−i ), Ii−1, t

−i

)(12.65)

To price an Asian option (or a vanilla option with discrete dividend pay-ments), first solve

−∂V∂t

= (rS −D(S, I, t))∂V

∂S+

1

2σ2S2∂

2V

∂S2− rV (12.66)

between sampling dates using V (t−i+1), the value of V immediately beforethe next sampling date ti+1 as final data. Then apply the appropriate jumpcondition, eqn. 12.63 or 12.65 across the current sampling date ti to obtainV (t−i ). Finally repeat this procedure as necessary to arrive at the currentoption price.

The running average

1

j(t)

j(t)∑

i=1

S(ti) =I

j(t)(12.67)

can also be written as:∫ tt0

∑Ni=1 S(s)δ(s− ti)ds∫ t

t0

∑Ni=1 δ(s− ti)ds

(12.68)

151

It follows that the option price satisfies eqn. 12.17 modified to:

−∂V∂t

= (rS−D(S, I, t))∂V

∂S+

1

2σ2S2∂

2V

∂S2+

N∑

i=1

δ(t− ti)S∂V

∂I− rV (12.69)

Away from the sampling dates ti this equation reduces to eqn. 12.66, soI appears only as a parameter. Across the sampling date ti the equationreduces to:

−∂V∂t

= δ(t− ti)S∂V

∂I(12.70)

so that V is constant along characteristics given by:

dI

dt= δ(t− ti)S (12.71)

i.e. byI − SH(t− ti) = constant (12.72)

This is equivalent to eqn. 12.62, i.e., I increases by S at ti. Thus

V(S(t+i ), Ii, t

+i

)= V

(S(t−i ), Ii−1, t

−i

)(12.63)

as before.

In similar fashion one can write:

exp

1

j(t)

j(t)∑

i=1

logS(ti)

as

exp

(∫ tt0

∑Ni=1 log S(s)δ(s− ti)ds∫ tt0

∑Ni=1 δ(s− ti)ds

)(12.73)

So if

I =

∫ t

t0

N∑

i=1

logS(s)δ(s− ti)ds (12.74)

across a sampling date, the Black–Scholes equation reduces to:

−∂V∂t

= δ(t− ti) logS∂V

∂I(12.75)

152

so that V is constant along characteristics given by:

dI

dt= δ(t− ti) logS (12.76)

i.e. byI − log SH(t− ti) = constant (12.77)

whenceV(S(t+i ), Ii−1 + log S(t−i ), t+i

)= V

(S(t−i ), Ii−1, t

−i

)(12.65)

for geometric Asian options.

12.4.4 Similarity reductions for arithmetic Asian op-

tions

A similarity reduction of the Black–Scholes equation is possible in case thepayoff function has the form:

SαF

(I

S, t

)

for some constant α and some function F , where, in this case

I =

∫ t

t0

S(s)ds

So assuming continuous averaging, let

ξ =I

S, V = SαW (ξ, t) (12.78)

then the Black–Scholes equation becomes:

−∂W∂t

= [1+((1−α)σ2−r)ξ]∂W∂ξ

+1

2σ2ξ2∂

2W

∂ξ2−(1−α)(r+

1

2ασ2)W (12.79)

with payoff functionW (ξ, 0) = F (ξ) (12.80)

This equation can be solved in terms of special functions but it is morepractical to solve it numerically.

153

12.4.5 The continuously sampled average strike option

In this case the payoff at expiry is

max

(0, S − 1

τ

∫ te

t0

S(s)ds

)

for a call, and

max

(0,

1

τ

∫ te

t0

S(s)ds− S

)

for a put.

Consider the average strike call, for both European and American versions.For the latter, the payoff for early exercise is

max

(0, S − 1

t− t0

∫ t

t0

S(s)ds

)= Smax

(0, 1 − R

t− t0

)

where

R =1

S

∫ t

t0

S(s)ds (12.81)

So the payoffs for early exercise and at expiry are, respectively:

Smax

(0, 1 − R

t− t0

), and Smax

(0, 1 − R

τ

)

A stochastic differential equation for R(t) can be derived as follows: Fort→ t+ dt we have:

R → R + dR =

∫ t+dtt0

S(s)ds

S + dS(12.82)

Since S satisfies the basic lognormal stochastic differential equation, eqn. 3.19,we can expand eqn. 12.82 to O(dt) to obtain:

dR = [1 − (µ− σ2)R]dt− σRdW (12.83)

Note that this stochastic differential equation does not depend explicitly onS.

154

Given the structure of the payoff functions introduced above we now postu-late that

V (S,R, t) = SW (R, t) (12.84)

On substituting this postulate into the Black–Scholes equation, in the case ofno dividend payments, and looking for a solution independent of S we find:

∆ = W − R∂W

∂R(12.85)

and∂W

∂t+ (1 − rR)

∂W

∂R+

1

2σ2R2∂

2W

∂R2≤ 0 (12.86)

for a portfolio of one option and −∆ shares of stock S.

Note that in the case of a European option the equality holds, and in thecase of an American option the inequality holds together with the arbitrageconstraint:

W (R, t) ≥ max

(0, 1 − R

t− t0

)≡ Λ(R, t) (12.87)

and the free boundary conditions W and WR continuous at the early exerciseboundary.

In the case of the European option we impose boundary conditions at R = 0and at R = ∞. To determine W (∞, t) is straightforward. Since S is boundedfor finite t the only way that R → ∞ is for S → 0, but then the option willnot be exercised, so

W (∞, t) = 0 (12.88)

To determine W (0, t) is more difficult. However we can see that as R → 0eqn. 12.86 reduces to:

∂W

∂t+∂W

∂R= 0 (12.89)

The set of equations can then be solved numerically, as shown in fig. 12.6below.

155

W

0.6

0.4

0.2

0.0

0 0.2 0.4 0.6 0.8

R

Figure 12.6: The solution W (R, t) for r = 0.1, σ = 0.4 for a European option.

12.4.6 Put–call parity for the European average strike

option

The payoff at expiry for a portfolio of one European average strike call heldlong and one European average strike option held short is:

Smax(0, 1 − R

τ) − Smax(0,

R

τ− 1) = S(1 − R

τ) (12.90)

The value of this portfolio is identical to one comprising one asset S and afinancial product whose payoff is −SR/τ .To price this product we look for a solution of eqn. 12.86 as:

W (R, t) = a(t) + b(t)R (12.91)

with a(te) = 0 and b(te) = −1/τ . On substituting eqn. 12.91 into 12.86 weobtain:

a(t) = − 1

rτ(1 − exp[−r(te − t)]), b(t) = −1

τexp[−r(te − t)] (12.92)

It follows from the above that:

C(S, T ) − P (S, T ) = S − V (S,R, T ) = S[1 −W (R, T )]

= S[1 − a(T ) − b(T )R] (12.93)

156

whence

C(S, T ) − P (S, T ) = S[1 − 1

rτ(1 − exp[−rT ])] − 1

τexp[−rT ]

∫ t

t0

S(s)ds

(12.94)where C and P are the prices of the European arithmetic average strikeoption. This is the put–call parity relation for the European average strikeoption.

Note that at expiry this reduces to:

C(S, 0) − P (S, 0) = S

[1 − R(te)

τ

](12.95)

as expected from eqn. 12.90

12.4.7 The American average strike option

The constraint in the case of the American average strike option is:

Λ(R, t) = max(0, 1 − R

t) (12.96)

By determining where and when the function Λ(R, t) satisfies eqn. 12.86 wecan make qualitative statements about the position of the free boundary.

Evidently

∂Λ

∂t+ (1 − rR)

∂Λ

∂R+

1

2σ2R2 ∂

2Λ

∂R2= −1

t+rR

t+R

t2≡ F (R, t) < 0 (12.97)

Since the American average strike option satisfies eqn. 12.86 with inequality,the free boundary R = RF (t) must be in the region where F ≤ 0. By lookingat the American vanilla option we expect that the free boundary originatesat the point R = RF (tF ) where F (R, tF ) = 0, i.e. where:

R − tF (1 − rR) = 0

i.e. at:

RF (tF ) =tF

1 + rtF(12.98)

157

We can now perform a local analysis of the free boundary near t = tF andR = RF . Let

W = W − (1 − R

t) (12.99)

For tF − t tF and |R− RF | 1 eqn. 12.86 can be approximated by

∂W

∂t+

1

2σ2RF (tF )2∂

2W

∂R2= (R −RF (tF ))

1 + rtFt2F

(12.100)

This local problem is close to that of the American vanilla option. We canconclude therefore, that:

RF (t) = RF (tF )

(1 + ξ0

√1

2σ2(tF − t) + · · ·

)

∼ tF1 + rtF

(1 + ξ0

√1

2σ2(tF − t) + · · ·

)(12.101)

with ξ = 0.9034. Figure 12.7 shows the result of a numerical computation ofW (R, t) at three months to expiry with three months of averaging alreadycarried out.

R0

W

1

RF

0.5

0.5

Figure 12.7: The solution W (R, t) for r = 0.1, σ = 0.4 for an American option.

158

12.4.8 Average strike foreign exchange options

These options give the right to by one unit of a foreign currency for theaverage domestic price over some period. Let rE be the exchange rate. Thenthe payoff (measured in foreign currency) is:

max

(0, 1 − 1

rE(te)τ

∫ te

t0

rE(s)ds

)(12.102)

Let

R(t) =

∫ tt0rE(s)ds

rE(t)(12.103)

Then the payoff at expiry can be rewritten as:

max

(0, 1 − R

τ

)

and that at early exercise as:

max

(0, 1 − R

t

)

Given that any option based on such an exchange rate is a function of R andt we can derive a Black–Scholes equation for its value, in the form:

∂V

∂t+ (σ2 − r + rF )R

∂V

∂R+

1

2σ2R2∂

2V

∂R2+∂V

∂R− rV ≤ 0 (12.104)

with strict equality for a European exchange rate option. rF is the continuousinterest rate on the foreign currency.

The delta of this option is then:

∆ = − R

rE

∂V

∂R(12.105)

What are the final and boundary conditions? Evidently

V (R, te) = max

(0, 1 − R

τ

)(12.106)

159

and as R → ∞, rE → 0, so the option will not be exercised, i.e.

limR→∞

V (R, t) = 0 (12.107)

However as R → 0 eqn. 12.104 reduces to:

∂V

∂t+∂V

∂R− rV = 0 (12.108)

on R = 0.

If the option is American we require the constraint:

V (R, t) ≥ max

(0, 1 − R

t

)(12.109)

Such a problem generates a free boundary R = RF (t) on which V and VRmust be continuous.

12.4.9 Average rate options

Here the asset price is replaced by an average in the payoff function. E.g.An arithmetic average rate call has payoff:

max

(0,I

τ− E

)

at expiry. Such options are more difficult to value since there is no similarityreduction. However the geometric average rate call can be reduced.

12.4.10 Geometric averaging and discrete sampling

Consider a European option with payoff:

V (S, I, 0) = Λ(I) (12.110)

Eqn. 12.17 with

f(S, t) = log SN∑

i=1

δ(t− ti)

160

gives:

−∂V∂t

= rS∂V

∂S+

1

2σ2S2∂

2V

∂S2+ log S

N∑

i=1

δ(t− ti)∂V

∂I− rV (12.111)

We look for a solution of the form:

V (S, I, t) = F (y, t), y =1

N(I + l(t) logS) (12.112)

where l(t) is to be determined.

On substituting this into eqn. 12.111 we obtain:

−∂F∂t

=1

N

(l logS + l(r − 1

2σ2)

)∂F

∂y+

1

2σ2 l

2

N2

∂2F

∂y2

+1

N

N∑

i=1

δ(t− ti) logS∂F

∂y− rF (12.113)

The choice:

l(t) =

∫ te

t

N∑

i=1

δ(s− ti)ds (12.114)

eliminates all the log S terms, leaving:

−∂F∂t

=l

N

(r − 1

2σ2

)∂F

∂y+

1

2σ2 l

2

N2

∂2F

∂y2− rF (12.115)

The function l(t) is just a staircase set up so that l = 0 at t = te whencey = I/N . Eqn. 12.115 is just a simple Black–Scholes equation with time–dependent volatility and interest rate, and a non-zero time–dependent divi-dend yield, i.e. eqn. 11.6 with:

σ(t) =σl(t)

N, r(t) =

rl(t)

N, D(t) =

1

2σ2 l(t

N(12.116)

The following rules show how to price the geometric average rate option.

• Apply the Black–Scholes formula for a vanilla option having the samepayoff as the geometric averaged rate option, with S = I/N , e.g.max(0, exp(I/N)−E) ≡ max(0, S−E). Call this formula VBS(S, r, σ, t)

161

• Let

σ2 → 1

N2T

∫ te

t

σ2l2(s)ds

• Let

r → 1

T

∫ te

t

[(r − 1

2σ2)

l(s)

N+

1

2

σ2

N2l2(s)

]ds

• Multiply the resulting formula by

exp

(−∫ te

t

[(r − 1

2σ2)

l(s)

N+

1

2

σ2

N2l2(s)

]ds

)

• Finally let

log S → I

N+l(t)

NlogS

162

Chapter 13

Bond pricing

A bond (paid upfront) yields a known amount on a known date in the future–the maturity date t = tm, so that T = tm − t = 0. The bond may also paya known cash dividend (the coupon) at fixed times during the life of thecontract. If there is no dividend the bond is called zero coupon. Bondsmay be issued by governments or by companies to raise capital. The upfrontpremium is equivalent to a loan to the government or company. How muchshould such a bond be worth? That is, how much should one pay now toobtain a guaranteed $1 in ten years time?

In order to motivate the formulation of the equations needed to answer thisquestion we consider the following scenario [see Cox & Rubinstein]: Supposea holding company–Hyde Park Holdings (HPH), has all its funds investedin Apple common stock–10000 shares each worth $12.70 on a certain datet. HPH also has two kinds of corporate securities outstanding: (a) 10000shares of its own common stock, and (b) 120 zero coupon bonds, each to pay$1000 on the maturity date tm > t. At this time HPH’s stockbrokers planto payoff the bonds by floating a new debt issue, but if the Apple stock isworth less than $12 per share they will be unable to do so–no one is going topay $120000 for a new debt issue if the underlying asset is not worth at leastthat amount. For the same reason HPH cannot raise enough to pay off thebonholders by selling more of its own common stock. Since HPH has limitedliability its stockholders are under no obligation to keep it solvent–and theywill let it go bankrupt–then the ownership of the Apple stock will pass tothe bondholders. Conversely if the Apple stock is worth more than $12 pershare at maturity the stockholders can sell enough of the common stock to

163

pay off the bondholders. Let S denote the $ value of a share of Apple stock,SHPH that of HPH’s common stock and BHPH that of the zero coupon bond.At maturity the following table obtains:

S ≤ 12 S > 12SHPH 0 S − 12BHPH S 12

Table 13.1: Payoff table at maturity

How much should SHPH and BHPH be worth at some time T = tm − t beforematurity? Evidently:

120BHPH(T ) + 10000SHPH(T ) = 10000S(T ) = 127000 (13.1)

However it is clear from Table 13.1 that SHPH(T ) is the value of a call optionon one share of Apple stock with strike price $12 expiring at tm with payoffmax(0, S(tm) − 12). Similarly 120BHPH(T ) equals the position of a coveredwriter who own 10000 shares of Apple stock and has written 10000 such callsagainst them.

Now consider the following closing call option at time T :

stock strike price call price stock price

Apple 12 2 1/10 12.70

Table 13.2: Call option quotes on Apple stock at time T

So the closing price of SHPH(T ) should equal the market price of the calloption on S i.e. $2.10 per share whence from eqn. 13.1:

120BHPH(T ) = 10000S(T )− 10000SHPH(T )

= 127000 − 21000 = 106000 (13.2)

whenceBHPH(T ) = 883.33 (13.3)

In fact there is more financial data on closing call prices on Apple stock attime T :

164

stock strike price call price stock price

Apple 13 1 3/5 12.7014 1 3/8 12.7015 0 3/4 12.70

Table 13.3: Further call option quotes on Apple stock at time T

So repeating the above calculation we obtain:

Bond payments Bonds Common stocks Apple stocks

120000 106000 21000 127000130000 111000 16000 127000140000 115625 11375 127000150000 119500 7500 127000

Table 13.4: Total payments and values at time T

The bondholder payments can also be thought of as equivalent to that re-ceived by someone who owns a zero coupon bond issued by HPH worth$111830 since it is not default free, and who has written 10000 European putoptions on S with the following financial data:

stock strike price put price stock price

Apple 12 7/12 12.70

Table 13.5: Put option quotes on Apple stock at time T

whence120BHPH(T ) = 111830 − 5830 = 106000 (13.4)

Let B be the promised payment to the bondholders, n the number of sharesoutstanding, and BT

HPH the total value of bonds. The following table sum-marizes the calculations above:

165

S(tm) ≤ B S(tm) > B Call RepresentationnSHPH S(tm) −B C(S(tm), B)BT

HPH S(tm) B S(tm) − C(S(tm), B)

Table 13.6: Summary table

The payoffs can also be represented in graphical form, as shown in fig. 13.1.By inspection of these payoff functions it is evident that:

S0 B

nSHPH

(a)

S0 B

BHPH

(b)

T

Figure 13.1: (a) Common stock payoff (b) Bond payoff. See text for details.

nSHPH(0) +BTHPH(0) = S(0) (13.5)

where S(0) is the market price of the Apple stock owned by HPH at maturity,and therefore a measure of the market value of the holding company. [IfHPH were a manufacturing company, its market value would be equal to themarket value of all its securities.] So we can think of HPH’s common stock asa call on S, the total value of the company’s securities, i.e. on the company’svalue. If S varies continuously with constant volatility σ and constant interestrate r, then the Black–Scholes formula gives the correct current value of thecommon stock SHPH(T ) and also of the bond value BT

HPH(T ), i.e.

nSHPH(T ) = SΦ(y) − Be−rTΦ(y − σ√T )

BTHPH(T ) = SΦ(−y) +Be−rTΦ(y − σ

√T ) (13.6)

with

y =log[S/Be−rT ]

σ√T

+1

2σ√T (13.7)

166

Evidently nSHPH(T ) +BTHPH(T ) = S[Φ(y) + Φ(−y)] = S(T ) as expected.

A comparision of the above formula with the Black–Scholes formula (seeeqn. 8.81) indicates that there is a direct correspondence between C(S, T )and BT

HPH(T ), and therefore the latter satisifes the Black–Scholes equation:

−∂BTHPH

∂t= rS

∂BTHPH

∂S+

1

2σ2S2∂

2BTHPH

∂S2− rBT

HPH (13.8)

13.1 Another derivation of the bond equation

We can also derive the bond equation more directly from arbitrage argumentsand stochastic calculus. Let V be the price of a bond. Suppose r and thecoupon payment B are known functions of t, so then V will also be a functionof t (V also depends on tm but we ignore this for the present.) SupposeV (tm) = Vm and we hold one bond. Then the change in the bond price fromt to t+ dt is:

dV =dV

dtdt (13.9)

If we also receive a coupon payment B(t)dt during this period, the bond pricechanges by the amount: (

dV

dt+B(t)

)dt (13.10)

Arbitrage considerations imply that this change equals

r(t)V (t)dt (13.11)

the return from a bank deposit receiving interest at a rate r(t), i.e.

dV

dt+B(t) = r(t)V (13.12)

The solution of this ordinary differential equation is easily obtained via theintegrating factor:

exp

[−∫ t

r(s)ds

](13.13)

and is:

V (t) = exp

[−∫ tm

t

r(s)ds

](Vm +

∫ tm

t

B(s) exp

[−∫ tm

s

r(u)du

]ds

)

(13.14)

167

Evidently V (t) = Vm as expected.

Now suppose that there exist zero coupon bonds with all possible maturitydates. So B(t) = 0 and therefore:

V (t) ≡ V (t, tm) = Vm exp

[−∫ tm

t

r(s)ds

](13.15)

If bond prices are quoted today for all maturity dates, we know V (t, tm) sowe can invert the above equation to give:

−∫ tm

t

r(s)ds = logV (t, tm)

Vm(13.16)

If V (t, tm) varies continuously with tm then we can differentiate the aboveequation to give:

r(tm) =1

V

∂V

∂tm(13.17)

So if market prices of the zero coupon bonds reflect a known deterministicinterest rate, then that rate at future dates is given from the bond prices byeqn. 13.17. Since both r and V are positive:

∂V

∂tm< 0 (13.18)

i.e. the longer a bond exists, the less is it worth.

13.1.1 The yield curve

There is another estimator of future interest rates, namely the yield curve.Given data on V (t, tm) let:

Y (t) ≡ Y (t, tm) = − 1

Tlog

V (t, tm)

Vm(13.19)

This is the yield curve Y as a function of the maturity date tm. The depen-dence of Y on T is called the term structure of interest rates.

• This definition has some advantages over that of eqn. 13.17 since (a) Acontinuous distribution of bonds with all maturity dates is not required(b) V (t, tm) doesn’t have to be differentiable.

168

• Y (t, tm) has the dimensions of an interest rate, i.e., inverse time. If ris constant the two definitions coincide.

Fig. 13.2 shows term structures for three yield curves. Type (a) is the most

(a)

(b)

(c)

Y

T0

Figure 13.2: Term structures of yield curves

common–future interest rates are higher than the short term rate since itshould be more rewarding to tie up money for a long time than for a shorttime. Type (b) is typical of periods when the short term rate is high butexpected to fall. Type (c) is similar to type (b).

13.1.2 Stochastic interest rates

Let r be the interest rate on the shortest possible deposit–the so–called spot

rate. We model this as a random process given by the stochastic differentialequation:

dr = u(r, t)dt+ ω(r, t)dW (13.20)

Given such an interest rate a bond has the price V (r, t, tm)

13.1.3 The bond equation

To price a bond we set up a hedge using bonds of differing maturity dates(since there is no underlying asset with which to hedge.) Thus:

Π = V1(t, tm1) − ∆V2(t, tm2) (13.21)

169

The change in this portfolio in time dt is given by Ito’s formula, and is:

dΠ =

(∂V1

∂t+

1

2ω2∂

2V1

∂r2

)dt+

∂V1

∂rdr

−∆

(∂V2

∂t+

1

2ω2∂

2V2

∂r2

)dt− ∆

∂V2

∂rdr (13.22)

Evidently the choice:

∆ =∂V1/∂r

∂V2/∂r(13.23)

eliminates the random component in dΠ. Eqn. 13.22 then becomes:

dΠ =

[∂V1

∂t+

1

2ω2∂

2V1

∂r2− ∂V1/∂r

∂V2/∂r

(∂V2

∂t+

1

2ω2∂

2V2

∂r2

)]dt (13.24)

However arbitrage arguments imply that

dΠ = rΠdt = r [V1 − ∆V2] dt = r

[V1 −

∂V1/∂r

∂V2/∂rV2

]dt (13.25)

It follows from these two equations that:

∂V1

∂t+ 1

2ω2 ∂2V1

∂r2− rV1

∂V1

∂r

=∂V2

∂t+ 1

2ω2 ∂2V2

∂r2− rV2

∂V2

∂r

(13.26)

But the left hand side of this equation is in general, a function of tm1 andthe right hand side of tm2. The only way for this to be possible is if neithersides depend on tm, i.e. if:

∂V

∂t+

1

2ω2∂

2V

∂r2− rV = a(r, t)

∂V

∂r(13.27)

for some function a(r, t).

We now assume that a(r, t) can be expanded as:

a(r, t) = λω(r, t) − u(r, t) (13.28)

for given ω 6= 0 and u, with λ ≡ λ(r, t) yet to be determined.

It follows that:

−∂V∂t

= (u− λω)∂V

∂r+

1

2ω2∂

2V

∂r2− rV (13.29)

170

To solve eqn. 13.29 we need one final and two boundary conditions. The finalcondition is simply

V (r, 0) = Vm (13.30)

The boundary conditions depend on the form of u and ω [see later.]

If there exist zero coupon payments, eqn. 13.29 can be formulated as:

−∂V∂t

= (u− λω)∂V

∂r+

1

2ω2∂

2V

∂r2− rV +B (13.31)

13.1.4 The market price of risk

Instead of the hedged portfolio constructed above, suppose we hold just onebond with maturity date tm. In time dt this bond changes in value by anamount:

dV =

(∂V

∂t+

1

2ω2∂

2V

∂r2

)dt+

∂V

∂rdr

=

(∂V

∂t+

1

2ω2∂

2V

∂r2

)dt+

∂V

∂r(udt+ ωdW )

=

(∂V

∂t+

1

2ω2∂

2V

∂r2+ u

∂V

∂r

)dt+ ω

∂V

∂rdW

=

(λω

∂V

∂r+ rV

)dt+ ω

∂V

∂rdW (13.32)

Using eqn. 13.29.

We can rewrite this as:

dV − rV dt = ω∂V

∂r(λdt+ dW ) (13.33)

Because of the term ωVrdW this portfolio is not riskless. The right handside may be interpreted as the excess return above the risk free rate, given acertain level of risk. In return the portfolio profits by an extra amount λdtper unit of extra risk dW . So λ is called the market price of risk.

Now consider the hedged portfolio:

Π = V1 − ∆V2 (13.34)

171

where V ≡ V (S, t) anddS = µSdt+ σSdW

From the above analysis it follows that:

−∂V∂t

= (u− λSσ)S∂V

∂S+

1

2ω2S2∂

2V

∂S2− rV (13.35)

with λS in place of λ.

But hedging options is easier than hedging bonds because of the existence ofan underlying asset. It therefore follows that V = S must be a solution ofeqn. 13.35, hence:

(µ− λSσ)S − rS = 0

i.e.

λS =µ− r

σ(13.36)

This is the market price of risk. If we substitute this relation into eqn. 13.35we obtain:

−∂V∂t

= rS∂V

∂S+

1

2ω2S2∂

2V

∂S2− rV

the Black–Scholes equation!

We also note that this choice of λS is precisely what we needed in eqn. 8.59 tochange the drift term µSdt into rSdt in the lognormal stochastic differentialequation for S. Once again this indicates the deep connection between thepricing of options and now bonds with fundamental aspects of stochasticcalculus.

13.2 Solving the bond pricing equation

What form do u and ω take? Suppose we assume that:

u = −γr + δ + λω

ω2 = αr − β (13.37)

where α, β, γ and δ are functions of t which can be used to fit empirical data.By suitable choices of these functions we can ensure that the random walkin r defined in eqn. 13.20 has the following properties:

172

• r ≥ βα> 0

• r → u

• If r = βα, then r → ∞

These conditions require the additional constraint:

δ ≥ β

αγ +

1

2α

Eqns. 13.20 and 13.37 cover the models of:

• Vasicek (1977)–α = 0, β, γ, δ time–independent

• Cox et.al. (1985)–β = 0, α, γ, δ time–independent

• Hull & White (1990)–α or β = 0 all other parameters time–independent

Given eqn. 13.37 the boundary conditions for eqn. 13.29 are:

V → 0, r → ∞; V

(β

α, t

)<∞ (13.38)

All these models lead to a solution for eqn. 13.29 in the form:

V (r, T ) = VmA(T ) exp[−rB(T )] (13.39)

with1

A

dA

dt= δB +

1

2βB2;

dB

dt=

1

2αB2 + γB − 1 (13.40)

with final conditions A(0) = 1, B(0) = 0 to satisfy V (r, 0) = Vm.

In the case of constant coefficients α, β, γ and δ exact solutions can be ob-tained as:

B(T ) =2(exp[ψ1(T )] − 1)

(γ + ψ1)(exp[ψ1(T )] − 1) + 2ψ1

(13.41)

and

2

αlogA(T ) = aψ2 log(a− B) + (ψ1 −

1

2β)b log(1 +

B

b) +

1

2βB − aψ2 log a

(13.42)

173

where

b, a = ±γα

+

√γ2 + 2α

α

and

ψ1 =√γ2 + 2α, ψ2 =

δ + 12αβ

a+ b

Note that:

• If all parameters are constant both A and B depend only on T = tm− tthe time to maturity

• As T → ∞, B → 2γ+ψ1

• As T → ∞, the yield curve Y (t, tm) = Y (T ) tends to:

Y (∞) =2

(γ + ψ1)2[δ(γ + ψ1) + β]

which is independent of Y (0) the spot rate.

13.2.1 Fitting the parameters

The market’s expectation about future interest rates is time varying. So letus keep the parameters α, β and γ constant, but let δ ≡ δ(t). This allowsus to fit the yield curve exactly. We first determine α and β from historicaldata using the smallest observed value of the stochastic interest rate r(t) andits volatility. We then find γ using the correlation between the spot rate andthe slope of the yield curve. Finally we choose δ(t) to fit the full yield curveexactly, via the solution of a certian integral equation.

As an example consider the pricing of a ten year bond.

• Suppose the minimum value of r has been estimated (possibly fromhistorical data)–so the value of α/β is known.

• The spot rate volatility is√αr − β which is easy to estimate provided

it is stationary.

174

• Eqn. 13.40 can be solved with δ = δ(T ) for small T . In such a case:

Y (T ) =1

T(rB − logA)

which can be approximated for T small as:

Y (0+) ∼ r − 1

2(γr − δ(0)) + · · ·

so that the slope of the spot rate

s =dY

dT

]

0

=1

2(δ(0) − γr) (13.43)

• So if γ > 0, then dr > 0 → s < 0 and dr < 0 → s > 0, i.e. the spotrate is mean reverting

• Since r follows a random walk, so does s, in fact:

ds = −1

2γdr

• From data we can estimate γ from the relation:

< dsdr >

< dr >2= −1

2γ (13.44)

However the value of γ estimated this way may be negative, in whichcase s may not be mean reverting.

• Eqn. 13.40 for A can be integrated directly to give

logA(T ) = −1

2β

∫ tm

t

B2(tm − s)ds−∫ tm

t

δ(s)B(tm − s)ds (13.45)

where the only unknown is δ(s). If we try to fit Y at some time T ∗

with the spot rate r∗ and known yield curve Y ∗, the four parametersare α∗, β∗, γ∗ and δ∗ and

Y ∗ =1

T(r∗B − logA)

175

whence∫ tm

t∗δ∗(s)B(tm−s)ds = (r∗B−Y ∗)T ∗− 1

2β∗∫ tm

t∗B2(tm−s)ds (13.46)

for t∗ ≤ tm < ∞. This equation can be solved numerically (or evenanalytically since it is a Volterra equation of convolution type. HoweverB does not have a simple Laplace Transform.) for δ∗(t).

• So now α, β, γ and δ are known whence B and A can be determinednumerically.

• Finally we arrive at the bond price formula:

VmA(T ) exp[−rB(T )]

• The model is strictly parameter dependent. If the parameters are non–stationary it becomes invalid.

• The extended Vasicek model (Hull & White 1990) has α = 0, β < 0whence

B(T ) =1

γ(1 − exp[−γT ]

and ∫ tm

t∗δ∗(s)(1 − exp[−γ(tm − s)])ds = F ∗(tm) (13.47)

where F ∗(tm) is given by the right hand side of eqn. 13.46. For eqn. 13.47to have a solution we require F ∗(0) = 0. Differentiating eqn. 13.47 twicewith respect to tm we obtain:

δ∗(tm) = F ∗(tm) + γF ∗′(tm) (13.48)

13.3 Interest rate products

13.3.1 Bond options

A bond option is identical to an equity option except that the underlyingasset is a bond. As an example consider a call option with strike price E andexpiration date te on a zero coupon bond with maturity date tm ≥ te. Let

176

VB(r, t, tm) be the value of such a bond, satisfying eqn. 13.29 with VB(tm) =Vm. Let CB(r, t) be the value of the call option on this bond. Since CBdepends on the random variable r it must also satisfy eqn. 13.29 with finalcondition:

CB(r, 0) = max(0, VB − E)

13.3.2 Swaps and caps

An interest rate swap is an agreement between two parties to exchange theinterest rate payments on a certain amount–the principal–for a certain dura-tion. One party pays the other a fixed rate of interest in return for a variableinterest rate payment. For example, A pays 9% of $1000000 per annum toB and B pays r of $1000000 per annum to A for three years. What is thevalue of such a swap? Suppose A pays r∗Z to B and B pays rZ to A. r∗ isfixed, and r varies. Let the value of the swap to A be ZV (r, t). In a timeinterval dt A receives Z(r∗ − r)dt. If we think of this payment as similar toa coupon payment on a simple bond, then V satisfies the eqn:

−∂V∂t

= (u− λω)∂V

∂r+

1

2ω2∂

2V

∂r2− rV + r − r∗ (13.49)

with final data V (r, 0) = 0. Since r − r∗ can be negative as well as positive,V need not be positive. So a swap is not necessarily an asset. Depending onthe yield curve it can be a liability.

A interest rate cap is a loan with floating interest rate r ≤ r∗ (the cap). Theloan Z is repayable at time tm. The value of the capped loan is ZV (r, t)where V (r, t) satisfies:

−∂V∂t

= (u− λω)∂V

∂r+

1

2ω2∂

2V

∂r2− rV + min(r, r∗) (13.50)

with final condition V (r, 0) = 1.

A interest rate floor is similar to a cap except that the interest rate r ≥ r∗

(the floor). To price this contract simply replace min by max in eqn. 13.50.

13.3.3 Swaptions, captions, and floortions

It is easy to value options on these financial instruments—called swaptions,

captions and floortions. Suppose that our swap, cap or floor which expires

177

at time ts has the value Vs(r, t) for t ≤ ts. An option to buy this swap (a callswaption) for an amount E at t < ts has value V (r, t) satisfying eqn. 13.29with final condition

V (r, 0) = max(0, Vs(r, 0) − E)

Thus we first solve for the value of the swap itself Vs(r, t) and then use thisas final data for the value V (r, t) of the swaption.

Similar arguments obtain for captions and floortions.

13.4 Convertible bonds

A convertible bond is like an ordinary bond, except that it may be exchangedfor a specified asset at any time of the owner’s choosing. This exchange iscalled conversion. The convertible bond on an underlying asset (value S)returns Vm at maturity unless at a time t < tm the owner has converted thebond into n units of the asset. The bond pay also pay a coupon to its holder.

It follows that: V = V (S, t). Repeating the Black–Scholes analysis with aportfolio comprising one convertible bond and −∆ assets:

Π = V − ∆S (13.51)

We find:

dΠ =

(∂V

∂t+ [B(S, t) − ∆D(S, t)] +

1

2σ2S2∂

2V

∂S2

)dt+

(∂V

∂S− ∆

)dS

where Bdt is a coupon payment on the bond and Ddt is a dividend on theasset.

Evidently

∆ =∂V

∂S

eliminates the risk from the portfolio. Since the return on such a portfolio isat most that from a bank deposit, it follows that:

∂V

∂t−D

∂V

∂S+

1

2σ2∂

2V

∂S2− rV +B ≤ 0 (13.52)

178

for the bond price V . This is just the Black–Scholes inequality with theaddition of a coupon payment term. The final condition is:

V (S, 0) = Vm

In addition since the bond may be converted into n assets

V ≥ nS (13.53)

Finally both V and VS must be continuous functions of S.

Thus a convertible bond is similar to an American option. However the finalcondition does not satify the pricing constraint. Thus although V (S, 0) = Vm,

V (S, 0−) = max(nS, Vm) (13.54)

The relevant boundary conditions are:

limS→∞

V (S, t) = nS; V (0, t) = Vm exp[−r(tm − t)] (13.55)

The second condition assumes that it is not optimal to convert when S = 0.

(b)

S0

V

1

1 2

(a)

S0

V

1

1 2

free boundary

Figure 13.3: The value of a convertible bond. Vm = 1, n = 1, r = 0.1, σ =0.25, T = 1 year, B = 0, (a) D = 0, (b) D = 0.05.

Eqn. 13.52 can be solved numerically in the manner of the American optionproblem, by solving a free boundary problem. It can be shown that dD > 0

179

makes early exercise more likely, whereas dB > 0 makes it less likely. WhenD = B = 0 the constraint in eqn. 13.52 only comes into play at expiration,and the convertible bond can be priced explicitly as ‘$ plus a European calloption’. Fig. 13.3 shows an example of such a calculation. Note that infig. 13.3(a) D = 0, but in fig. 13.3(b) D 6= 0 so there is a free boundary,and therefore for S ≥ SF the bond should be converted. Note also thatsometimes bond conversion is permitted only during specified periods.

13.4.1 Call and put aspects of convertible bonds

Some bonds give the issuing company the right to re–purchase them at anytime (or during specified periods) for a fixed amount. Such a bond is worthless than one without such a call feature. Let such a boond be repurchasedfor $M1. The no arbitrage profits conditions implies that:

V (S, T ) ≤M1 (13.56)

Since eqn. 13.53 also holds for such a convertible bond, an obstacle problemexists.

One can easily incorporate intermittent conversion, and put instead of callfeatures, allowing the bond holder to return the bond to the company for anamount $M2. Again arbitrage arguments require that:

V (S, T ) ≥M2 (13.57)

13.4.2 Convertible bonds with random interest rates

When interest rates are random, a convertible bond has a value

V ≡ V (S, r, T )

with S and r independent random variables satisfying, respectively, thestochastic differential equations:

dS

S= µdt+ σdW1 (3.19)

anddr = u(r, t)dt+ ωdW2 (13.20)

180

where dW1 and dW2 are drawn from Gaussian distributions with zero meanand variance dt, and correlated such that:

< dW1dW2 >= ρdt (13.58)

with |ρ(S, r, t)| ≤ 1.

In order to analyze this situtation we need to extend Ito’s lemma, eqn. 3.16 tothe case of two correlated random variables. We use the following heuristics:

dW 21 = dt, dW 2

2 = dt

dW1dW2 = ρdt (13.59)

and apply Taylor’s theorem to V (S, r.t) to obtain:

dV =∂V

∂tdt+

∂V

∂SdS +

∂V

∂rdr

+1

2

∂2V

∂S2dS2 +

1

2

∂2V

∂r2dr2 +

∂2V

∂S∂r+O(dt2)

To leading order,

dS2 = σ2S2dW 21 = σ2S2dt

dr2 = ω2dW 22 = ω2dt

dSdr = σSωDW1dW2 = σSωρdt (13.60)

whence:

dV =

(∂V

∂tdt +

1

2σ2S2∂

2V

∂S2+ ρσSω

∂2V

∂S∂r+

1

2ω2∂

2V

∂r2

)dt

+∂V

∂SdS +

∂V

∂rdr (13.61)

This is the bivariate generalization of Ito’s lemma. We use this result to pricea convertible bond by constructing the portfolio:

Π = V1 − ∆2V2 − ∆S (13.62)

The first bond matures at T1 and the second at T2.

The analysis is as before–the choices:

∆2 =∂V1

∂r/∂V2

∂r

181

and

∆ =∂V1

∂S− ∆2

∂V2

∂Seliminate risk from such a portfolio. Eventually one obtains the price equa-tion:

−∂V∂t

= rS∂V

∂S+ (u− λω)

∂V

∂r− rV

+1

2σ2S2∂

2V

∂S2+ ρσSω

∂2V

∂S∂r+

1

2ω2∂

2V

∂r2= 0 (13.63)

with λ(S, r, t) the market price of risk.

This is the required bond pricing equation. It generalizes both the Black–Scholes equation (u = ω = 0) and the simple bond pricing equation (∂/∂S =0). If there are dividends and coupon payments the equation becomes:

−∂V∂t

= (rS −D)∂V

∂S+ (u− λω)

∂V

∂r− rV +B

+1

2σ2S2∂

2V

∂S2+ ρσSω

∂2V

∂S∂r+

1

2ω2∂

2V

∂r2= 0 (13.64)

The final condition and American option constraints are as before. Theboundary conditions are a bit more complicated. Thus for a convertiblebond with no call feature V (S, r, t) → nS as S → ∞ where V (0, r, t) is givenby the solution of a simple bond equation; V (S, r, t) → 0 as r → ∞ andfinally V (Sr, t) <∞ for r = rmin.

13.4.3 The issue of new shares

Converting a bond into n shares requires the company to issue n new shares(unlike a stock option). Let S be the value of an asset and N the number ofshares before conversion. The convertible bond pricing equation (with knownor stochstic r) is to be solved subject to the constraints:

nS

n+N≤ V ≤ S, V (S, 0) = Vm (13.65)

The constraint in eqn. 13.65 bounds V below by its value on conversion,and above by the value of the asset–thus allowing the company to declarebankruptcy if the bond becomes too valuable. The factor:

n

n+N≡ d (13.66)

182

is called the dilution. Fig. 13.4 shows the price of such a convertible bondfor a % 50 dilution. In the limit n/N → 0 this model becomes identical to

S0

V

1

1 2

free boundary

3

dilution

Figure 13.4: The value of a convertible bond following dilution. Vm = 1, r =0.1, σ = 0.25, D = 0.05, T = 1 year, d = 0.5.

the previous one. Note that the total value of the company issuing the bondis S − V and the share price is (S − V )/N .

183

Chapter 14

Transaction costs

Since the Black–Scholes equation applies to a portfolio which is re–hedgedcontinuously, if the costs (e.g. the bid–ask–spread on S–the difference be-tween price quotes to buy or sell S) are independent of the time–scale ofrehedging, then the total transaction costs may become unbounded. In ad-dition such costs can vary from individual to individual, depending on thesize of their respective portfolios. So contrary to the Black–Scholes theorythe value of an option is not unique, but depends upon the investor.

14.0.4 A modified Black–Scholes equation

We follow the analysis of Leland (1985) to incorporate transaction costs intothe Black–Scholes formulation:

• The portfolio is revised every ∆t where ∆t is a fixed time step

• In discrete time we have the random walk

∆S = µS∆t+ σSz√

∆t (14.1)

• Transaction costs in buying and selling the asset S are proportional toS–so if ν shares are bought (ν > 0) or sold (ν < 0) at a price S, thenthe transaction costs are

κ|ν|Swith κ depending on the individual investor

• the hedged portfolio has an expected return equal to that from a bankdeposit

184

Let Π be the value of the hedged portfolio, then by the usual analysis wehave:

∆Π =

(∂V

∂t+ µS

∂V

∂S+

1

2σ2z2S2∂

2V

∂S2− µ∆S

)∆t

+σS

(∂V

∂S− ∆

)z√

∆t− κ|ν|S (14.2)

Note that we cannot take the usual limit ∆ → 0, however we can hedge, solet:

∆ =∂

∂SV (S, t) (14.3)

and then re–hedge after a finite interval ∆t so that:

∆ =∂

∂SV (S + ∆S, t + ∆t) (14.4)

Evidently the difference

∂

∂SV (S + ∆S, t+ ∆t) − ∂

∂SV (S, t)

gives the number of assets traded, i.e. ν.

But

∂

∂SV (S + ∆S, t + ∆t) =

∂V

∂S+ ∆S

∂2V

∂S2+ ∆t

∂2V

∂S∂t+ · · ·

=∂V

∂S+ (µS∆t+ σzS

√∆t)

∂2V

∂S2+ · · ·

=∂V

∂S+ σzS

∂2V

∂S2

√∆t+O(∆t) (14.5)

so:

ν = σzS∂2V

∂S2

√∆t (14.6)

It follows that

< κ|ν|S > = < κσ|z ∂2V

∂S2|S2

√∆t >

= < |z| > κσS2|∂2V

∂S2|√

∆t

=

√2

πκσS2

∣∣∣∣∂2V

∂S2

∣∣∣∣√

∆t (14.7)

185

using an earlier result that < |z| >=√

2/π.

It follows that

< ∆Π >=

(∂V

∂t+

1

2σ2S2∂

2V

∂S2− κσS2

√2

π∆t

∣∣∣∣∂2V

∂S2

∣∣∣∣

)∆t (14.8)

But the expected return on a bank deposit is:

r

(V − S

∂V

∂S

)∆t

thus:(∂V

∂t+

1

2σ2S2∂

2V

∂S2− κσS2

√2

π∆t

∣∣∣∣∂2V

∂S2

∣∣∣∣

)∆t = r

(V − S

∂V

∂S

)∆t

whence:

−∂V∂t

= rS∂V

∂S+

1

2σ2S2∂

2V

∂S2− κσS2

√2

π∆t

∣∣∣∣∂2V

∂S2

∣∣∣∣− rV (14.9)

Now recall that:∂2V

∂S2= Γ (8.98)

Thus after re–hedging, what is left is a correction proportional to the degreeof mishedging of the hedged portfolio. This involves the gamma of V andthe time interval ∆t.

Eqn. 14.9 is a nonlinear partial differential equation that is also valid for aportfolio of options–it is the presence of transaction costs that destroys thelinearity of the Black–Scholes equation.

Consider the following example. Suppose a trader holds two calls, one long,the other short on an asset S with the same expiration date te and strikeprice E. The net position is therefore zero. If these are embedded in a largebook of options he may overlook the cancellation effect and hedge themseparately. Because of transaction costs he will lose money at each re–hedgeon both options, and at expiry will have a negative net balance, since thetwo payoffs cancel but the transaction costs remain. On the other hand if

186

the trader recognizes the cancellation effect he never rehedges and will havea zero balance at expiry.

Consider next the effect of costs on a single option held long. It follows fromthe Black–Scholes formulas for calls and puts that:

Γ > 0

for a single call or put held long in the absence of transaction costs. Supposethis is also true in the presence of transaction costs, i.e.

∣∣∣∣∂2V

∂S2

∣∣∣∣ =∂2V

∂S2(14.10)

in eqn. 14.9, and let

σ2 = σ2 − 2κσ

√2

π∆t(14.11)

Eqn. 14.9 can then be rewritten as:

−∂V∂t

= rS∂V

∂S+

1

2σ2S2∂

2V

∂S2− rV (14.12)

But this is just the Black–Scholes equation with σ instead of σ.

In similar fashion, for a short option position the Black–Scholes equationagain obtains, except that all the signs change except for those involvingtransaction costs, i.e.

σ2 = σ2 + 2κσ

√2

π∆t(14.13)

These results indicate that a long position in a single option is such that:

σ2 < σ2 (14.14)

This occurs because when S increases the holder of the option must sell someassets to remain ∆–hedged, but the effect of the bid–ask–spread on S is tolower its selling price, so the effective increase in S is smaller. The converseis true for a short option position.

Consider now the total transaction costs for a single option. Look at

V (S, t) − V (S, t) (14.15)

187

Assuming that κ is small we expand about σ to obtain

(σ − σ)∂V

∂σ

whence from eqn. 8.81 for the Black–Scholes formula for a European call, wefind the expected spread to be:

2κSΦ′(y)√te − t√

2π∆t(14.16)

where y is given by eqn. 8.82.

LetK =

κ

σ√

∆t(14.17)

If K 1 the transaction costs term swamps the basic volatility, so thetransaction costs are too large, and ∆t is too small–the portfolio is beingre–hedged too often, Conversely, if K 1 the transaction costs have littleeffect on the volatility and are too small, so ∆t is too large and should bedecreased to minimize risk–the portfolio is not being re–hedged often enough.

14.0.5 Portfolios of options

For a general portfolio Π, the Γ is not of one sign. So

∣∣∣∣∂2V

∂S2

∣∣∣∣ 6=∂2V

∂S2(14.18)

In general one needs to solve the nonlinear equation 14.9 numerically. As anexample we show the computation of the value of a so–called bullish vertical

spread. This is bullish since the portfolio increases in value if S increases.It is vertical because it comprises a call held long with strike price E1 anda call held short with a different strike price E2, and it is a spread since itcomprises more than one call. The payoff for such a spread is:

max(0, S − E1) − max(S − E2, 0)

Fig. 14.1 shows the result of such a computation.

188

E2

E21E

1E-

V

0 S

Figure 14.1: The value of a bullish vertical spread. One call held long E1 = 45,one call held short E2 = 55, σ2 = 0.4, r = 0.1, T = 6 months, K = 0.25, solid curve(with transaction costs), dashed line (without transaction costs).

189

Chapter 15

Time Series

15.1 Linear systems

Consider a linear system that transforms the input sequence xt into theoutput sequence yt. A system is linear if:

λ1x1(t) + λ2x2(t) → λ1y1(t) + λ2y2(t) (15.1)

and stationary if:x(t) → y(t), x(t− τ) → y(t− τ) (15.2)

15.1.1 Time domain analysis

A stationary linear system may be rewritten as:

y(t) =

∫ ∞

−∞h(u)x(t− u)du (15.3)

in the continuous case, and as

yt =

∞∑

−∞hkxt−k (15.4)

in the discrete case. The weight function h(u) or hk is the Green’s functionor Impulse response of the system. Evidently eqns. 15.3 and 15.4 are linear.The system is said to be physically realizable or causal if:

h(u) = 0, u < 0

hk = 0, k < 0 (15.5)

190

xt yt ht

Figure 15.1: A linear system with impulse response ht.

and stable if ∑

k

|hk| < C <∞ (15.6)

Examples

1.

yt =1

3xt−1 +

1

3xt +

1

3xt+1 (15.7)

This is a non–causal moving average or MA, with impulse response:

hk =

1/3 k = −1, 0, 10 otherwise

(15.8)

2.

τd

dty(t) + y(t) = x(t) (15.9)

with impulse response:

h(u) =

e−u/τ u ≥ 00 u < 0

(15.10)

so that:

y(t) = y(0)e−t/τ +

∫ ∞

0

e−u/τx(t− u)du (15.11)

3.

yt + α1∇yt + α2∇2yt + · · ·= β0xt + β1∇xt + β2∇2xt + · · · (15.12)

where∇yt = yt − yt−1 (15.13)

191

Eqn. 15.12 can be rewritten in the form of eqn. 15.4 by successivesubstitution. Thus:

yt = a1yt−1 + a2yt−2 + · · · + b0xt + b1xt−1 + · · · (15.14)

So if

yt =1

2yt−1 + xt (15.15)

then

yt = xt +1

2xt−1 +

1

4xt−2 + · · · (15.16)

with impulse response:

hk =

2−k k = 0, 1, · · ·0 k < 0

(15.17)

4. In the case of a simple delay we have

yt = xt−d (15.18)

with

hk =

1 k = d0 otherwise

(15.19)

and for a simple gain:yt = gxt (15.20)

with

hk =

g k = d0 otherwise

(15.21)

In continuous time these reduce to:

y(t) = x(t− τ), h(u) = δ(u− τ)

y(t) = gx(t), h(u) = gδ(u) (15.22)

5. An important class of causal systems has the impulse response:

hk =

τ−1ge−(u−∆)/τ u > ∆0 u < ∆

(15.23)

192

1

toto

ht

Figure 15.2: The impulse response ht.

In general the impulse response has the property that if

xt =

0 t 6= t01 t = t0

thenyt =

∑

k

hkxt−k = ht

In similar fashion the step response, defined by:

S(t) =

∫ t

−∞h(u)du (15.24)

1

toto

St

Figure 15.3: The impulse response ht.

in the continuous case, and by

St =∑

k≤thk (15.25)

in the discrete case, has the property that if:

xt =

0 t 6= t01 t = t0

193

thenyt =

∑

k

hkxt−k =∑

k≤thk = St

For a delayed exponential

S(t) = g(1 − e−(t−∆)/τ

), t > ∆ (15.26)

15.1.2 Frequency domain analysis

Let

h(ω) =

∫ ∞

−∞h(u)e−iωudu, (0 < ω <∞) (15.27)

andhω =

∑

k

hke−iωk, (0 < ω < π) (15.28)

h(ω) is called the frequency response or transfer function.

Theorem: A sinusoidal input to a linear system give rise, in the steady state,to a sinusoidal output at the same frequency as the input.

Proof: Let x(t) = cosωt. Then

y(t) =

∫ ∞

−∞h(u) cos[ω(t− u)]du

= cos[ωt]

∫ ∞

−∞h(u) cos[ωu]du+ sin[ωt]

∫ ∞

−∞h(u) sin[ωu]du

= a(ω) cos[ωt] + b(ω) sin[ωt]

= g(ω) cos[ωt+ φ(ω)] (15.29)

where

g(ω) =√

(a2(ω) + b2(ω))

φ(ω) = tan−1[−b(ω)/a(ω)] (15.30)

g(ω) is called the gain of the system and φ(ω) the phase shift.Similarly for x(t) = sinωt, and x(t) = exp[iωt] when

y(t) = g(ω)eiφ(ω)x(t) (15.31)

i.e.y(t) = h(ω)x(t) (15.32)

194

Thus we have shown that for sinusoidal steady state behavior

y(t) = h(t) ? x(t) = h(ω)x(t) (15.33)

where

h(t) ? x(t) ≡∫ ∞

−∞h(u)x(t− u)du (15.34)

is the convolution of h with x.More generally we have from the faltung theorem:

y(ω) =

∫ ∞

−∞y(u)e−iωudu

=

∫ ∞

−∞h(u) ? x(u)e−iωudu

= h(ω)x(ω) (15.35)

However in the sinusoidal steady state case we can use eqn. 15.33. so:

x(t) =∑

j

aj(ωj)eiωj t → y(t) =

∑

j

aj(ωj)h(ωj)eiωj t.

15.1.3 Gain and phase diagrams

h(ω) = g(ω)eiφ(ω) (15.36)

g

0 ω π

g

0 ω π

low pass filter high pass filter

Figure 15.4: Filter gain characteristics.

195

g

0 ω π

MA

1

Figure 15.5: The gain characteristic of a moving average.

0 π

π

2

π

2-

φ

Figure 15.6: The phase characteristic of a causal low pass filter.

196

In general φ is not uniquely determined.

Examples.

• MA:

yt =1

3(xt−1 + xt + xt+1)

hk =

1/3 k = −1, 0,+10 otherwise

then

h(ω) =1

3(e−iω + 1 + eiω)

=1

3(1 + 2 cosω), (0 < ω < π)

so

g(ω) = |H(ω)| = |13(1 + 2 cosω)|

=

13

+ 23cosω 0 < ω < 2π/3

−(13

+ 23cosω) 2π/3 < ω < π

(low pass)

and

φ(ω) =

0 0 < ω < 2π/3π 2π/3 < ω < π

•h(u) = gτ−1e−u/τ , u > 0

h(ω) = g(1 + iωτ)−1

= g(1 − iωτ)(1 + ω2τ 2)−1, ω > 0

whence

g(ω) = |h(ω)|=

√(h(ω)h?(ω))

= g(1 + ω2τ 2)−1/2,

φ(ω) = tan−1(−ωτ).

197

0 π

π

2

π

2-

φ

g

0 ω π

Figure 15.7: Gain and phase characteristics of h = g/τ exp(−u/τ).

•

y(t) = x(t− ∆)

h(u) = δ(u− ∆)

h(ω) =

∫ ∞

−∞δ(u− ∆)e−iωudu

= e−iω∆,

or

x = eiωt

y = eiω(t−∆) = e−iω∆x.

It follows thatg(ω) = 1

and since

h(ω) = cosω∆ − i sinω∆

= a(ω) − ib(ω)

tanφ(ω) = −b/a= − tanω∆

= tan(−ω∆)

198

soφ(ω) = −ω∆.

• More general inputs:

y(t) =

∫ ∞

−∞h(u)x(t− u)du (15.37)

So:

y(ω) =

∫ ∞

−∞e−iωtdt

=

∫ ∞

−∞

∫ ∞

−∞h(u)x(t− u)e−iωtdudt

=

∫ ∞

−∞

∫ ∞

−∞h(u)e−iωux(t− u)e−iω(t−u)dudt.

but∫ ∞

−∞x(t− u)e−iω(t−u)dt =

∫ ∞

−∞x(t)e−iωtdt

= x(ω)

and ∫ ∞

−∞h(u)e−iωudu = h(u),

soy(ω) = h(ω)x(ω) (15.38)

15.2 Stochastic processes

Consider a random or stochastic process x(t) such that:

< x(t) >= µ(t), < (x(t) − µ(t))2 > = σ2

< (x(t1) − µ(t1))(x(t2) − µ(t2)) > = γ(t1, t2) (15.39)

µ, σ2 and γ are, respectively, the mean, variance and autocovariance of x(t).

199

15.2.1 Stationary processes

If

p(x(t1), x(t2), · · ·x(tn)) = p(x(t1 + ∆), x(t2 + ∆), · · ·x(tn + ∆)) (15.40)

for all ti, i = 1, · · · , n and ∆, then x is said to be stationary.

If n = 1 thenp(x(t)) = p(x(t+ ∆)) = p(x)

and µ(t) = µ, σ2(t) = σ2.

If n = 2 and if p(x(t1), x(t2)) depends only on t1 − t2 = τ , then the autoco-variance < (x(t) − µ)(x(t + τ) − µ) >= cov(x(τ)x(t + τ)) = γ(τ). Thus theautocovariance depends only on τ . Such a process is called weakly stationary.

Let

ρ(τ) =γ(τ)

γ(0)=γ(τ)

σ2(15.41)

be the autocorrelation function of the process. A weakly stationary stochasticprocess has mean µ and autocorrelation function ρ(τ).

15.2.2 Spectral analysis

Given a discrete stationary stochastic process xt it has a representation:

xt =

∫ π

0

cosωtdu(ω) +

∫ π

0

sinωtdv(ω) (15.42)

where u and v are uncorrelated continuous processes and 0 < ω < π. This iscalled the spectral representation of xt.

It can be shown that

γ(τ) =

∫ π

0

cosωτdF (ω). (15.43)

F (ω) is called the (power) spectral distribution function. It is such that:

F (ω) = 0, ω < 0

F (π) =

∫ π

0

dF (ω) = γ(0) = γ2x (15.44)

200

For many stationary stochastic processes F (ω) is a continuous function of ω.So we can define:

f(ω) =dF (ω)

dω(15.45)

to be the (power) spectral density function, or power density spectrum.

It follows that:

γ(τ) =

∫ π

0

cosωτf(ω)dω (15.46)

and conversely,

f(ω) =1

π

∞∑

τ=−∞γ(τ)e−iωτ (15.47)

or since γ(τ) is an even function:

f(ω) =1

π

[γ(0) + 2

∞∑

τ=1

γ(τ) cosωτ

](15.48)

Eqns. 15.46 – 15.48 comprise one form of the Wiener–Khinchine theorem.

For a continuous process γ(τ) and f(ω) form a Fourier cosine pair.

We can summarize the content of this theorem in the following diagram:

We can now prove the following theorem:

Let x(t) be a (weakly) stationary stochastic process with power density spec-trum fx(ω) and g(ω) the transfer function of a stable linear system. Let y(t)be the response of such a system to x(t). Then y(t) has the power densityspectrum g2

x(ω)fx(ω).

To prove this first assume that µx = µy = 0. Then

γy(τ) = < y(t)y(t+ τ) >

= <

∫ ∞

−∞h(u)x(t− u)du

∫ ∞

−∞h(u′)x(t + τ − u′)du′ >

=

∫ ∞

−∞

∫ ∞

−∞h(u)h(u′) < x(t− u)x(t + τ − u′) > dudu′

=

∫ ∞

−∞

∫ ∞

−∞h(u)h(u′)γx(τ − u′ + u)dudu′ (15.49)

201

We now take Fourier transforms of both sides of this equation. Whence fromeqn. 15.47 we obtain:

fy(ω) =1

π

∫ ∞

−∞dτe−iωτ

[∫ ∞

−∞

∫ ∞

−∞h(u)h(u′)γx(τ − u′ + u)dudu′

]

=

∫ ∞

−∞h(u)h(u′)

[1

π

∫ ∞

−∞dτe−iωτγx(τ − u′ + u)

]dudu′

=

∫ ∞

−∞h(u)eiωuh(u′)e−iωu

′

dudu′

·[

1

π

∫ ∞

−∞dτe−iω(τ−u′+u)γx(τ − u′ + u)

]

= h(ω)h?(ω)fx(ω)

= g(ω)eiφ(ω)g(ω)e−iφ(ω)fx(ω)

= g2(ω)fx(ω) (15.50)

Examples

• MAxt = β0Zt + β1Zt−1 + · · ·+ βqZt−q (15.51)

where Zt is a random process with variance σ2z .

Then

h(ω) =

q∑

j=0

βje−iωj (15.52)

Since zt is random its autocovariance function is:

γ(τ) =

σ2z τ = 0

0 otherwise(15.53)

So using eqns. 15.46–15.48 we obtain:

fz(ω) =1

πσ2z (15.54)

It follows that:

fx(ω) =

∣∣∣∣∣

q∑

j=0

βje−iωj

∣∣∣∣∣

2σ2z

π(15.55)

202

• First–order MAxt = zt + βzt−1 (15.56)

h(ω) = 1 + βe−iω

so

g2(ω) = h(ω)h2(ω)

= (1 + βe−iω)(1 + βeiω)

= 1 + 2β cosω + β2

whence

fx(ω) = (1 + 2β cosω + β2)σ2z

π(15.57)

• An autoregressive (AR) process

A (linear) autoregression is of the form:

xt = α1xt−1 + · · ·+ αqxt−q + zt (15.58)

Thus a first–order AR process is:

xt = αxt−1 + zt (15.59)

which may be rewritten as the linear system:

zt = xt − αxt−1 (15.60)

withh(ω) = 1 − αe−iω (15.61)

whenceg2(ω) = 1 − 2α cosω + α2

and

fz(ω) =σ2z

π= (1 − 2α cosω + α2)fx(ω)

so that

fx(ω) =σ2z

π(1 − 2α cosω + α2)−1 (15.62)

203

•y(t) =

dx(t)

dt

It follows that if x(t) = eiωt then y(t) = iωeiωt so

h(ω) = iω (15.63)

and thereforefy(ω) = |iω|2fx(ω) (15.64)

provided the system is stable, i.e., provided its output has finite vari-ance.

Now for a continuous stationary random process the Wiener–Khinchinetheorem tells us that:

γy(τ) =

∫ ∞

0

fy(ω) cosωτdω (15.65)

whence

σ2y =

∫ ∞

0

fy(ω)dω =

∫ ∞

0

ω2fx(ω)dω

It follows from eqn. 15.65 that:

γ′′x(τ) = −∫ ∞

0

ω2fx(ω) cosωτdω

and therefore thatσ2y = −γ′′x(τ)]τ=0 (15.66)

So y(t) has finite variance iff γx(τ) is twice differentiable at τ = 0 inwhich case eqn. 15.64 holds.

15.3 Linear systems identification

15.3.1 Frequency domain methods

Suppose we have the system described by the equation:

y(t) =

∫ ∞

0

h(u)x(t− u)du+ n(t) (15.67)

204

where h(t) is unknown and n(t) is delta–correlated noise. The problem is todetermine h(t).

Once again we suppose that µx = µy = 0. It follows that

< y(t)x(t− ∆) >=

∫ ∞

0

h(u) < x(t− u)x(t− ∆) > du+ < n(t)x(t− ∆) >

i.e.

γxy(∆) =

∫ ∞

0

h(u)γxx(∆ − u)du (15.68)

where γxy(∆) is the cross–covariance function of x(t) and y(t), and n(t) andx(t) are uncorrelated.

In discrete time eqn. 15.68 becomes:

γxy(∆) =

∞∑

k=0

hkγxx(∆ − k) (15.69)

Eqns. 15.68 and 15.69 are Wiener–Hopf equations of the first kind, inwhich the unknown function is the transfer function h(t) or hk. It mightappear that that these equations can be solved by an application of theFourier transform. However if we require that the impulse response h(t) becausal, and therefore that h(t) ≡ 0 for t < 0 then this implies that eqn. 15.68need only hold for t ≥ 0 or equivalently ∆ ≥ 0. This precludes the use ofFourier transform methods. To see this multiply both sides of eqn. 15.68 byexp iω∆ and integrate for ∆ > 0. The result is:

∫ ∞

0

γxy(∆)eiω∆d∆ =

∫ ∞

0

eiω∆d∆

∫ ∞

0

h(u)γxx(∆ − u)du

=

∫ ∞

0

h(u)du

∫ ∞

0

eiω∆γxx(∆ − u)d∆

=

∫ ∞

0

h(u)eiωudu

∫ ∞

−ueiωsγxx(s)ds

where ∆ − u = s.We see that the last integral has a lower limit −u that is not fixed. This

precludes the direct use of the Fourier transform. However, since u ≥ 0 thelast integral would not involve u if γxx(∆) = 0,∆ < 0. In general this is notpossible since γxx(∆) is a correlation function. But we can replace γxx(∆) by

205

a function which vanishes for negative ∆. We therefore introduce functionsψ1(∆) and ψ2(∆) and a function α(∆) such that:

ψ1(∆) = 0, t < 0, ψ2(∆) = 0, t > 0 (15.70)

γxx(∆) =

∫ ∞

−∞ψ2(u)ψ1(∆ − u)du (15.71)

and

γxy(∆) =

∫ 0

−∞α(∆ − u)ψ2(u)du (15.72)

One can then show that

α(∆) =

∫ ∞

0

ψ1(∆ − u)h(u)du, ∆ > 0 (15.73)

Thus we have only to solve eqn. 15.73 to solve for the causal transfer functionh(t). But this can now be obtained via the Fourier transform, since eqn. 15.73has the same form as eqn. 15.68 except that ψ1(∆) = 0,∆ < 0. Of coursethe functions α(∆) and ψ1(∆) remain to be determined. These were foundoriginally by Wiener in terms of the correlation functions γxx(∆) and γxy(∆).

15.3.2 Time domain methods

We consider the discrete time version of eqn. 15.67 in the form:

yt =∞∑

k=0

hkxt−k + nt (15.74)

where xt and yt are mean–corrected time series, and the series hk isto be determined.

To solve this equation we follows Box and Jenkins and consider first atime series model that combines MA and AR processes into an autoregressive

moving average or ARMA model of xt:

xt = α1xt−1 + · · ·αpxt−p + zt + β1zt−1 + · · ·βqzt−q (15.75)

This can be written in the form:

φ(B)xt = θ(B)zt (15.76)

206

where

φ(B) = 1 − α1B − · · · − αpBp

θ(B) = 1 + β1B + · · ·+ αqBq (15.77)

andBxt = xt−1 (15.78)

is the backward shift operator.

• For a stationary process αi is such that the roots of φ(B) = 0 lieoutside the unit circle, and if the process is also invertible βi is suchthat the roots of θ(B) = 0 also lie outside the unit circle.

• ARMA processes are useful in that a stationary time series may oftenbe described by such a process or model with fewer parameters thaneither a pure AR or MA process

It follows thatθ−1(B)φ(B)xt = zt (15.79)

This transforms the input xt into the purely random process zt, i.e., intowhite noise.

In similar fashionθ−1(B)φ(B)yt = yt (15.80)

Thus the output process yt is also prewhitened.

Let

h(B) =

∞∑

k=0

hkBk (15.81)


yt = h(B)xt + nt (15.82)

so that

yt = θ−1(B)φ(B)[h(B)xt + nt]

= h(B)zt + θ−1(B)φ(B)nt (15.83)

207

It follows that

< ytzt−m > = γyz(m)

= h(B)γzz(m) + θ−1(B)φ(B)γzn(m) (15.84)

which can be rewritten as:

γyz(m) = hmσ2z (15.85)

sinceγzz(m) = σ2

zδ(m), γzn(m) = 0 (15.86)

The result is that:

hm =γyz(m)

σ2z

(15.87)

This gives an estimate of the impulse response in terms of the sampled co-variance of the pre–whitened inputs and outputs of the system, and of thesampled variance of the pre–whitened inputs.

In practice most time series are nonstationary. If this is nothing more thana simple trend, replacing xt by the difference ∇xt = xt − xt−1 = (1 − B)xtwill remove the trend, or in general replacing xt by (1−B)dxt = wt, in whichcase:

φ(B)wt = θ(B)zt (15.88)

i.e.φ(B)(1 −B)dxt = θ(B)zt (15.89)

This is called an integrated model since eqn. 15.88 specifies an ARMA modelfor wt and has to be integrated to provide a model for xt. Such a modelis then referred to as an integrated ARMA model, or ARIMA model withparameters p, d, and q. It is non–stationary since

φ(B)(1 − B)d = 0

has roots on the unit circle.

15.4 State–space models

A typical measurement is usually contaminated by noise, i.e.: observationequals signal plus noise, or symbolically:

xt = (h, θ)t + nt (15.90)

208

where h and θ are (m× 1)–dimensional vectors and

(h, θ) ≡ hT θ (15.91)

The vector h is assumed to be known, and the vector θ is assumed to bean unknown state vector. nt is the observation error. Usually θ satisfies anupdate equation of the form:

θt = Gtθt−1 + wt (15.92)

where the matrix Gt is known, and where wt is a vector of Gaussian devia-tions.

Eqns. 15.90 and 15.92 form the state space model. Eqn. 15.90 is calledthe measurement equation and 15.92 the system equation. nt and wt areuncorrelated with each other and serially:

< nt > = 0, < ntnt−τ >= σ2δ(τ)

< wt > = 0, γw(τ) = Wτ (15.93)

where Wτ is the covariance matrix of the multivariate Gaussian process wt.

Regression, ARMA, and ARIMA models etc. can be expressed as state spacemodels.

Examples

• Linear regression.

xt = at + btut + nt

at = at−1 + wt

bt = bt−1 + wt (15.94)

becomes:

xt = (h, θ)t + nt

θt = θt−1 + wt (15.95)

with

h =

(1u

), θ =

(ab

)(15.96)

209

• The AR(2) model can be rewritten as:

xt = φ1xt−1 + φ2xt−2 + zt (15.97)

Let

ht =

(10

), θt =

(xtxt−1

)(15.98)

and

Gt =

(φ1 φ2

1 0

)(15.99)

Then the measurement equation becomes:

xt = (ht, θt) (15.100)

with < n >=< n2 >= 0, and the system equation:

θt = Gtθt−1 + htzt (15.101)

Thuswt = htzt (15.102)

15.4.1 Parameter estimation

Having selected some model of a time series we need to estimate its param-eters. For example, in the case of linear regression we need to estimate θt.One way to do this is to minimize the quantity:

ε =∑

t

((h, θ)t − xt)2 =

∑

t

((∑

i

aiui)t − xt

)2

(15.103)

Let the number of measurements be n, i.e., t = 0, 1, 2, · · · , n − 1, and thenumber of state variable components be m, e.g. m = 2 corresponds toθT = (a b).

We first assume that the ai are constant, and that wt = 0, so:

ε =n−1∑

t

(m∑

i

aiuit − xt

)2

(15.104)

210

which can be rewritten in the form:

ε = ‖Θθ − x‖2 (15.105)

where x = (x0, x1, x2, · · · , xn−1) and

Θ =

1 u0

1 u1

1 u2

· ·· ·· ·1 un−1

, θ =

(ab

)(15.106)

But

∇θ ‖Θθ − x‖2 = ∇θ

[(Θθ − x)T (Θθ − x)

]

= 2(Θθ − x)T∇θ(Θθ − x)

= 2(Θθ − x)TΘ

It follows that at a minimum of ε (as a function of θ)

∇θε = ∇θ ‖Θθ − x‖2 = 0

and therefore:ΘT (Θθ) = (ΘTΘ)θ = ΘTx (15.107)

• ΘTΘ is a 2 × 2 square matrix and ΘTx is a column vector with 2components

• In fact ΘTΘ is symmetric and non–singular, so we can rewrite eqn. 15.107as:

θ = (ΘTΘ)−1ΘTx (15.108)

This is the least squares solution of eqn. 15.95

• The matrix (ΘTΘ)−1ΘT is called the generalized or pseudo inverse ofΘ in case m ≤ n.

211

• The inverse matrix C = (ΘTΘ)−1 is closely related to any uncertaintiesof the estimated parameters θ. In fact:

(C)ii = σ2(ai)

(C)ij = cov(ai, aj) (15.109)

so C is the covariance matrix of θ.

• It follows that an estimate or prediction of x is given by:

x = Θθ (15.110)

In practice we estimate θ and then use the equation:

xt = (ht, θ) (15.111)

to predict xt.

• Let xt be the true value of xt. Then the mean–square error in theestimate is:

< (xt − xt)2 > = < x2 > − < x >2 + < x >2 + < x2 > −2 < xtxt >

= var[xt] + [< xt > −xt]2 (15.112)

since < xt >= xt. Thus the mean–square error in the estimate equalsthe variance plus the bias of the estimate. Typically when the bias islow, the variance is high. Choosing estimators (models) often involvesa tradeoff between bias and variance.

15.4.2 The Kalman filter

We now drop the assumption that the ai are constants. Suppose we wish toestimate θt. We can do this using the Kalman filter.

Suppose we have xt−1, xt−2, · · · , xt−n, and suppose that θt−1 is the leastmean–squares or LMS estimator for θt−1 based on information available uptotime t− 1. Suppose also that the covariance matrix

Rt|t−1 =< θt−1 θT

t−1 > (15.113)

212

has been evaluated. It follows form eqn. 15.92 that:

θt|t−1 = Gt θt−1 (15.114)

andRt|t−1 = GtRt−1GTt + wt (15.115)

Eqns. 15.114 and 15.115 are called Prediction equations. When xt becomesavailable θt can be updated. The prediction error is given as:

εt = xt − (ht, θt|t−1) (15.116)

An LMS solution for θt then gives:

θt = θt|t−1 + Ktεt

Rt = (Λ −KthTt )Rt|t−1 (15.117)

where

Kt = Rt|t−1ht

hTt Rt|t−1 ht + σ2(15.118)

where K is the Kalman gain matrix, although in this (univariate) case it isa vector, and Λ is the identity matrix. Eqns. 15.117 comprise the Kalmanupdate equations. Eqns. 15.114–15.117 are recursive, and converge quickly.

Example

• Linear regression. Eqns. 15.94 and 15.95 with constant coefficients,such that wt = 0,Gt = Λ, and:

θt|t−1 = θt−1

Rt|t−1 = Rt−1 (15.119)

In such a case eqns. 15.114–15.117 reduce to a recursive way to do leastsquares minimization, i.e.:

4θt = (xt − (ht, θt−1))Rt−1ht

hTt Rt−1ht + σ2

= (xt − (ht, θt−1)) η (15.120)

213

Chapter 16

Neural nets

16.1 Logic, computation, and McCulloch–Pitts

nets

About 100 years ago, prompted by the mathematician Hilbert, Whitehead& Russell attempted to formalize mathematics as a logical system. Such asystem comprises:

• axioms: propositions known (or assumed) to be TRUE

• a rule of inference: e.g. If A is TRUE, and if A implies B then B isTRUE

• formulas: which can be proved by the rule of inference to be TRUE orelse FALSE

An elementary example of a logical system is Boolean logic in which allformulas or propositions are formed by composition using only AND, OR,and NOT. So if A and B are (binary) propositions (either TRUE (T) or elseFALSE (F)) then A ANDB, AORB, and NOT A are also propositions.

16.1.1 Truth tables

One can establish the truth or falsity of propositions in the following way:A AND B is TRUE if and only if A and B are both TRUE, corresponds tothe Truth table:

214

A B AND

T T TT F FF T FF F F

Table 16.1: Truth Table for A ANDB

Similarly A OR B is TRUE if A is TRUE, or else B is TRUE, or both Aand B are TRUE corresponds to the truth table:

A B OR

T T TT F TF T TF F F

Table 16.2: Truth Table for AORB

and NOT A is TRUE if and only if A is FALSE corresponds to the truthtable:

A NOT

T FF T

Table 16.3: Truth Table for NOT A

16.1.2 Venn and Peirce diagrams

There exist other representations of such propositions related to set theory.Fig. 16.1 shows the structure of the Venn diagrams corresponding to theabove propositions: These diagrams can be simplified (see fig. 16.2). The

215

T TF F

A

FTF TT

A AND B A OR B

F T

NOT A

Figure 16.1: Venn diagrams.

Figure 16.2: Transformation from Venn to Peirce diagram.

A NOT A

A AND B A OR B

Figure 16.3: Peirce diagrams.

216

result is the Peirce diagrams shown in fig. 16.3.

The Peirce diagrams for three and four binary variables can as be drawn, asshown in fig. 16.4.

3 variables 4 variables

Figure 16.4: Peirce diagrams for three and four binary variables.

These diagrams allow us to compute how many logical propositions can becomposed from n binary propositions—since any logical proposition corre-sponds to a pattern of dots in a Peirce diagram, the number is 22n

.One can also do Boolean algebra directly with Peirce diagrams. The

Peirce diagram for the proposition A ANDB shown in fig. 16.3 correspondsto the rule: put a dot in the diagram if and only if the site has dots in thecorresponding sites in both the univariate A and B diagrams. For example:

16.1.3 The Hilbert–Ackerman theorem

Since there is an obvious one–to–one correspondence between truth tablesand Peirce diagrams it is clear that the basic diagram is essentially a propo-sition of the form:

(A AND B) OR (A AND NOT B) OR (NOT A AND B) OR(NOT A AND NOT B)

for two propositions, and so on.

NotationLet A,B, · · · ≡ x1, x2, · · · ; x1ANDx2 ≡ x1·x2; x1ORx2 ≡ x1+x2; NOTx ≡ x.

217

Theorem: (Hilbert & Ackerman) Any Boolean proposition of binary variablesxi(i = 1, 2, · · · , xn) can be expanded as:

f(x1, x2, · · · , xn) = f(x) =2n∑

j=1

ζjxj (16.1)

where

ζj =

10

and xj is one of the basic patterns in the truth table, indexed by j =1, 2, · · · , 2n. Evidently there are 22n

such propositions.

16.1.4 Godel’s theorem and Turing machines

Is Boolean logic powerful enough to express all mathematical propositions?The answer is no! One needs logical quantifiers such as: ∃ (THERE EXISTS)and ∀ (FOR ALL). For example:

(∃y) such that (∀x ⊂ y) (x is true)

Boolean logic plus quantifiers is called First order predicate logic. It wasclaimed by Hilberts and others early in the twentieth century that with such alogic one could establish the truth or falsity of all mathematical propositionsor predicates, using only the rules of logical inference, i.e. that the logicsystem is complete. In 1931 Godel proved this claim to be false. Any logicalsystem powerful enough to be able to express mathematical propositions orpredicates is either:

• complete but inconsistent—some predicates will be shown to be bothTRUE and FALSE

• consistent but incomplete—not all predicates can be proved to be eitherTRUE or else FALSE, by the rules of inference of the system.

Thus there can exist true predicates which are unprovable within the sys-

tem. These can be added to the system as axioms, but then new unprovablepredicates will exist.

In 1937 Turing formalized Godel’s theorem in terms of computation, viawhat is now called a Turing machine. A Turing machine comprises

218

• A (potentially) infinitely long tape on which are fields containing amark or nothing.

• A set of instructions for

1. reading a field

2. erasing and/or writing on a field

3. moving to a new field

• propositions or predicates are encoded in terms of such dot patterns

• inference rules are embodied in the set of instructions

Turing showed that new propositions or predicates can be derived fromold ones within such a system, and that proving the validity of a predicatecorresponds to a finite sequence of actions on the tape. Problems generatinginfinite sequences correspond to unprovable or undecidable predicates withinthe system. Turing also proved the existence of a universal Turing machine,whose instruction set can mimic or simulate the action of any other Turingmachine. Since this can still generate infinite sequences of tape actions, thisshows the universality of Turing’s analysis and of Godel’s theorem.

16.1.5 McCulloch–Pitts nets

In 1943 McCulloch & Pitts made the connection between Turing machines,and a simplified model of a net of nerve cells or neurons. A McCulloch–Pittsor MP–neuron differs from a real one in that time is synchronized at discreteinstants, and all the complicated biophysics of the flows of ionic currentsacross the neuron membrane is summarized in two simple formulas. LetV (n) be the voltage across the input terminals of a neuron at the discreteinstant n where n = 0, 1, 2, · · · . Suppose that there are a number of differentinputs to the neuron, each with a conductance or weight wi, i = 1, 2, · · · , N .Then the voltage appearing across the input terminals can be written as:

v(n+ 1) =

N∑

i=1

wiδ(n− ni) (16.2)

We introduce the encoding:

δ(n− ni) ≡ (xi(n) = 1) (16.3)

219

So that eqn. 16.2 can be rewritten as:

v(n+ 1) =N∑

i=1

wixi(n) (16.4)

with xi(n) a binary variable.

With this encoding we can now write the firing condition for an MP–neuronas:

xi(n+ 1) = Θ [vi − vTH ]

= Θ

[N∑

j=1

wijxj(n) − vTH

](16.5)

where Θ [v] is the Heaviside step function. This condition allows us to com-pute xi(n + 1), the (binary) output of any MP–neuron in a net, given theinputs xj(n). Fig. 16.5 shows a pictorial representation of such a neuronmodel.

x1

x2

w1

w2

vTH

y

Figure 16.5: A McCulloch–Pitts neuron.

220

We modify this slightly by letting:

vTH = wi j+1xj+1 = wi j+1 · 1 (16.6)

i.e. the threshold neuron is always ON, whence

xi(n+ 1) = Θ

[N+1∑

j=1

wijxj(n)

](16.7)

where xN+1 = 1 and wi N+1 = −wTH . Fig. 16.6 shows the modified model.

x1

x2

w1

w2

y

xTH

THw-

Figure 16.6: The modified McCulloch–Pitts neuron.

We rewrite the voltage or excitation v(n+1) as w1x1+w2x2−wTH1. Evidentlyby suitable choice of the weights w1, w2, wTH we can express the Booleanfunctions AND, OR, and NOT in terms of McCulloch–Pitts neurons. Theresult is shown in fig. 16.7:

221

x1

x2

1

1

-21

y 1= x AND x2

x1

x2

1

1

1

y 1 2= x OR x

-1

x1-1

10

y 1= NOT x

Figure 16.7: McCulloch–Pitts neurons implementing AND, OR, and NOT.

222

The Hilbert–Ackerman theorem now tells us that any Boolean function canbe expressed by combinations of these elementary functions. For example

x1 OR ELSE x2 = (x1 AND NOT x2) OR (NOT x1 AND x2)

The result is shown in fig. 16.8:

1

1

1 1

11

-1 -1

-1

-1

-11

y(n+2)

x1(n)

x2(n)

Figure 16.8: A McCulloch–Pitts net implementing x1 OR ELSE x2.

In fact [Kleene 1951] any Boolean function can be implemented in a feedfor-

ward net with at most a delay of 2 units.

223

What about quantifiers? Here McCulloch and Pitts introduced a newconcept. Consider the circuit shown in fig. 16.9.

-11

1

1x(n)

y(n+1)

y(n) = 1

y(1) = 0

Figure 16.9: McCulloch–Pitts net implementing the quantifier ∃.

Given y(1) = 0 and y(n) = 1 there must have been some time m < n forwhich x(m) = 1, i.e. (∃m)(m < n)(x(m) = 1). Similarly, if wTH = −2,then x(m) = 1, ∀m < n, i.e. (∀m)(m < n)(x(m) = 1). Thus given theconcept of an internal state, MP nets with feedback or recurrent loops canimplement the quantifiers ∃ and ∀. It follows that all first order predicatescan be implemented in a recurrent MP net, and so such a net is equivalent toa Turing machine with a finite tape. A finite net of MP neurons has a finitenumber of states and can implement most, but not all, the computationsperformed by a Turing machine. Thus MP nets can represent or simulate

almost all the processes that can be computed by a universal Turing machine.McCulloch and Pitts’ work has had important consequences:

• It triggered von Neumann’s work on programmable digital computers

• It suggested that the brain is a kind of computer, and that MP neuralnets could be used to simulate and study computations executed in thebrain.

224

16.2 Perceptrons and Adalines

It is not too difficult to find sets of rules or predicates governing simple be-haviors, and to work out the relevant net architecture and weights to achievethem. However it becomes too difficult to do this for more complex behav-ior, e.g. for any brain–like behavior involving the recognition of complexpatterns. It was not until 1959 that significant progress on this problem wasmade by Rosenblatt with his analysis of how to obtain the required weightsby training.

Rosenblatt introduced a machine called a Perceptron—in effect an MPnet whose weights could be modified after interaction with a trainer, so as toconverge to the implementation of some required Boolean function. In orderto analyze the Perceptron training algorithm (and related ones) we need tolook again at MP neurons using techniques from analytical geometry andcalculus. In this respect it is helpful to introduce vector representations ofv(n), xi, and wij etc.

Thus we can express the modified version of eqn. 16.4 in the form

v(n + 1) = w1x1(n) + w2x2(n) − wTH · 1= (w, x)(n) (16.8)

where the weight vector w = (w1, w2,−wTH) and the input vector x =(x1, x2, 1).

The firing condition is then:

y(n+ 1) = Θ [(w, x)(n)]

=

1 (w, x) ≥ 00 (w, x) < 0

(16.9)

A geometric interpretation.Let

x ∈A if v ≥ 0B if v < 0

(16.10)

Evidently from eqns. 16.9 and 16.10

x ∈A if y = 1B if y = 0

(16.11)

This simply maps Boolean functions on 1, 0 into A,B. Thus suppose∃w such that the truth table looks like:

225

x1 x2 1 y1 1 1 11 0 1 10 1 1 10 0 1 0

Then

(1, 1, 1), (1, 0, 1), (0, 1, 1) ∈ A

(0, 0, 1) ∈ B

We can represent this geometrically in the cartesian space (x1, x2, 1) asshown in fig. 16.10 as:

001 110

001 101

A A

AB

Figure 16.10: The geometry of a truth table [see text for details].

If we take the convex hulls1 of all points in A and B respectively, we find thepicture shown in fig. 16.11.

Evidently we can draw a line that separates A and B. Such a line is givenby the eqn.

(w, x) = 0 (16.12)

1convex combinations are of the form λ1x1+λ2x2+· · ·+λnxn where λ1+λ2+· · ·+λn = 1

226

A

B

( w, x ) = 0_ _


The Boolean function satisfying this condition (x1 OR x2) is said to be lin-

early separable—for in such a case:

Co(A) ∩ Co(B) = ∅ (16.13)

Conversely the Boolean function x1 OR ELSE x2 which has the truth table(given appropriate weights):

x1 x2 1 y1 1 1 01 0 1 10 1 1 10 0 1 0

has

(1, 0, 1), (0, 1, 1) ∈ A

(1, 1, 1), (0, 0, 1) ∈ B

leading to the picture shown in fig. 16.12:

227

A B


andCo(A) ∩ Co(B) 6= ∅ (16.14)

i.e. x1 OR ELSE X2 is linearly inseparable.How many linearly separable Boolean functions are there? One can show

that for large N there are only O(2N+1) such functions (these are all thefunctions that can be implemented using one MP neruon with N inputs.)Since there are 22N

functions of N variables, it follows that:

limN→∞

2N+1

22N= 0 (16.15)

i.e. the fraction of linearly separable Boolean functions is vanishingly small.

16.2.1 The Perceptron training algorithm

(Novikoff 1962)We consider an MP neuron with N inputs, so that x = (x1, x2, · · · , xN , 1)

and w = (w1, w2, · · · , wN ,−wN+1), with excitation v = (w, x) and firingcondition:

(w, a) ≥ 0, a ∈ A

(w, b) < 0, b ∈ B (16.16)

which we can refine to:

(w, a) ≥ θ > 0, a ∈ A

(w, b) ≤ −θ < 0, b ∈ B (16.17)

228

Then the sets A and B are clearly linearly separable.

The algorithm:Let

wk = wk−1 + xk, if (wk−1, xk) < θ and xk ∈ A

= wk−1, if (wk−1, xk) ≥ θ and xk ∈ A

= wk−1 − xk, if (wk−1, xk) > −θ and xk ∈ B

= wk−1, if (wk−1, xk) ≤ −θ and xk ∈ B (16.18)

Thus wk is increased if it is too weak, and decreased if it is too strong,otherwise it is unchanged.

A simplification

Let

yk

=

xk if xk ∈ A−xk if xk ∈ B

(16.19)

and prune the training sequence x1, x2, x3, · · · , xk, · · · of all those inputsfor which y

k= 0, i.e. all those inputs for which the output is correct—i.e.

the inputs which belong to the sets A or else B.

Then provided there is no integer m such that:

wm = wm+1 = wm+2 · · · (16.20)

we are left with the weight update equation:

wk = wk−1 + yk

(16.21)

where(yk, wk−1) < θ, k = 1, 2, 3, · · · (16.22)

It follows from these last two equations that:

wk = w0 + y1+ y

2+ · · · + y

k(16.23)

so if w? is the desired weight vector satisfying the firing condition in eqn. 16.17:

(w?, yk) ≥ θ > 0 (16.24)

then

(w?, wk) = (w?, w0) + (w?, y1) + (w?, y

2) + · · ·+ (w?, y

k)

≥ (w?, w0) + kθ (16.25)

229

It follows from the Cauchy–Schwarz inequality that:

|w?|2 |wk|2 ≥ (w?, wk)2 (16.26)

whence

|wk|2 ≥[(w?, w0) + kθ]2

|w?|2(16.27)

For large enough k this leads to the inequality:

|wk| ≥ Ck (16.28)

where C is some constant.

But from eqns. 16.21 and 16.22 we can also show that:

|wk|2 = (wk, wk)

= (wk−1 + yk, wk−1 + y

k)

=∣∣wk−1

∣∣2 + 2(wk−1, yk) +∣∣∣yk

∣∣∣2

<∣∣wk−1

∣∣2 +∣∣∣yk

∣∣∣2

+ 2θ (16.29)

whence:

|wk|2 − |w0|2 <∣∣∣y

1

∣∣∣2

+∣∣∣y

2

∣∣∣2

+ · · · +∣∣∣yk

∣∣∣2

+ 2kθ (16.30)

Since the y vectors relate directly to the input vectors x [eqn. 16.19] whichare of finite length, it follows that:

|wk|2 < |w0|2 + k(M + 2θ)

so that for large enough k, |wk|2 < D2k, or

|wk| < D√k (16.31)

On combining eqns. 16.28 and 16.31 we obtain:

Ck ≤ |wk| < D√k (16.32)

If the sets A and B are bounded this inequality becomes false for large enoughk. So there must be an integer m satisfying eqn. 16.20, i.e. the weight updatealgorithm reaches a steady state after m <∞ steps.

230

x1 x2 1 y0 y?

1 1 1 0 11 0 1 0 00 1 1 1 00 0 1 0 0

Example

Consider the following situation in which y0 is the initial output value andy? is the desired value.

Initially this corresponds to the logical function (NOT x) AND y with aweight vector

w0 = (−2, 1,−1)

corresponding to that line which separates the firing set A from the non–firing set B. and we want to find a weight vector wm to give the logical

011 111

001 101x

y

-1_2

Figure 16.13: The initial state of a Perceptron.

function x AND y. So initially at k = 0 we have the previous table with thevoltage v added:

We now apply the algorithm, eqn. 16.29. Row 1 of the table shows thaty? > y, so we add 1 to all the activated weights so that:

w1 = (−1, 2, 0).

231

x1 x2 1 v y y?

1 1 1 −2 0 11 0 1 −3 0 00 1 1 0 1 00 0 1 −1 0 0

Row 2 now changes to:1, 0, 1, −1, 0, 0.

Now y? = y so w2 = w1 and row 3 becomes

0, 1, 1, +2, 1, 0.

The outcome of this is that y? > y so we subtract 1 from all the activatedweights to give

w1 = (−1, 1,−1).

and row 4 becomes0, 0, 1, −1, 0, 0.

whence y? = y and w4 = w3. Geometrically the separating line is now

−1 · x1 + 1 · x2 − 1 · 1 = 0

as shown below: Thus the separating line shifts with training. If we continue

011 111

001 101x

y

-1

Figure 16.14: The state of a Perceptron after 1 cycle of training.

this process we find that it reaches a steady state, in our case after almost

232

seven cycles, i.e. y → y? at m = 27, and the separating line is given by theequation:

x1 + 2x2 − 3 = 0

as shown in fig. 16.15:

x

y

011 111

001 101 3

3

2_

Figure 16.15: The steady state of a Perceptron after almost 7 cycles of training.

Interestingly, we can speed up the process by using a different encoding. Let

σ[v] = 2Θ[v] − 1

=

+1 if v ≥ 0−1 if v < 0

(16.33)

where +1 corresponds to TRUE and −1 to FALSE. Such an encoding iscalled a spin encoding. If we apply this encoding to the above example weobtain a steady state after no more than 2 cycles, m = 7, with a separatingline given by

x1 + 2x2 − 2 = 0.

16.2.2 The Adaline training algorithm

(Widrow & Hoff 1960)Shortly after the Perceptron algorithm appeared, Widrow and Hoff intro-

duced a somewhat different algorithm that uses gradient descent in weight

space (Gabor 1954) to reach a steady state. They called this the Adap-tive Linear or Adaline training algorithm. In the Perceptron algorithm the

233

difference or errory? − y

is used to determine weight changes. The Adaline algorithm instead uses thefunction:

ζ =1

2(y? − v)2

=1

2(y? − (w, x))2 (16.34)

Evidently, for each input vector x, fixed or clamped, ζ is a function of theweights w1, w2, · · · , wN+1. So one solution to the programing problem is tominimize ζ as a function of these weights. This can be done, for example,by using gradient descent.

It follows from eqn. 16.34, if ζ(w1, w2, · · · , wN+1) is sufficiently smooth, thatthe change in ζ is:

∆ζ =∂ζ

∂w1∆w1 +

∂ζ

∂w2∆w2 + · · ·+ ∂ζ

∂wN+1∆wN+1

= (∇wζ,∆w)

= ∆w ∇wζ (16.35)

the directional derivative of ζ in the direction given by the vector ∆w.

So if we set:∆w = −η∇wζ, η > 0 (16.36)

then

∆ζ = −η∇wζ ∇wζ

= −η |∇wζ|2 ≤ 0 (16.37)

Thus we can define a recursion:

ζk+1 = (1 + ∆)ζk

= ζk − η |∇wζk|2

≤ ζk (16.38)

More precisely, from eqn. 16.34:

∂ζ

∂wi= − (y? − (w, x))xi

234

whence∇wζ = − (y? − (w, x)) x (16.39)

So

|∇wζ|2 = (∇wζ,∇wζ)

= (y? − (w, x))2 (x, x)

= 2ζk|x|2

whence

ζk+1 = ζk − η2ζk|x|2=

(1 − 2η|x|2

)ζk (16.40)

Thus the condition for convergence of the algorithm is:

|x|2 < 1

2η(16.41)

Note that the weight changes are given by:

∆w = −η∇wζ

= η (y? − (w, x))x (16.42)

The parameter η is usually referred to as the learning rate of the algorithm.

16.2.3 Comparision of Perceptrons and Adalines

The Perceptron training algorithm can be rewrtten in the form:

∆w = η (y? − y[v])x

= η (y? − y[(w, x)]) x (16.43)

So if

y? =

y ∆w = 0> y ∆w = ηx< y ∆w = −ηx

if we use the encoding 1, 0, and if

y? =

y ∆w = 0> y ∆w = 2ηx< y ∆w = −2ηx

235

if we use the spin encoding +1,−1. Thus we see that the Perceptrontraining algorithm differs from the Adaline algorithm only in that y[(w, x)]appears instead of just (w, x). For this reason the second algoritm was de-scribed as “adaptive linear”, hence the acronym Adaline.

We now apply the Adaline algorithm to the problem considered earlier. Sincethere are 3 inputs, x1, x2, and x3, |x|2 = 3, so eqn. 16.41 gives η < 1/6. Wetherefore choose a learning rate η = 1/8 whence:

∆w =1

8(y? − v)x (16.44)

The following table shows the first cycle of the algorithm:

x1 x2 x3 w1 w2 w3 v y? ζ1 1 1 −2 1 −1 −2 1 4.51 −1 1 −1.63 1.37 −0.63 −3.63 −1 3.46−1 1 1 −1.3 1.04 −0.3 2.04 −1 3.46−1 −1 1 −0.92 0.66 −0.68 −0.42 −1 0.17

Fig. 16.16 shows the variation of ζ over almost 10 cycles of training. It will

0.01

0.10

1.00

10.0

10 20 30 400

k

ε

Figure 16.16: The variation of ζ over almost 10 cycles of training.

be seen that the error ζ does not converge to zero but to an average value ofabout 0.25.

236

16.2.4 Some problems with the Adaline algorithm

Consider the following situation, which is intrinsic to the setup:

x y? v1 1 1 1 w1 + w2 + w3

1 −1 1 −1 w1 − w2 + w3

−1 1 1 −1 −w1 + w2 + w3

−1 −1 1 −1 −w1 − w2 + w3

So the weight vector w must satisfy:

w1 + w2 + w3 = 1

w1 − w2 + w3 = −1

−w1 + w2 + w3 = −1

−w1 − w2 + w3 = −1

i.e. three unknowns (w1, w2, w3) and four equations. Ignoring the fourthequation we obtain

(w1, w2, w3) = (1, 1,−1)

Although this gives−w1 − w2 + w3 = −3

the solution does not correspond to the correct firing condition—i.e. the setof weights won’t generate ζ = 0 for all xi but it will get close enough.

16.2.5 A variation of the Algorithm algorithm

We formulate the training problem as follows: Given the sequence

(x1, y

?

1), (x2, y

?

2), · · · , (xL, y?L)

what is the best w that computes y?[x]? Suppose L > N+1 so that the systemof equations is over–determined, and let vk = (wk, xk) and εk = y?k−vk. Then

237

the mean squared error ξ is defined as:

2ξ = L−1L∑

k=1

εTk εk

=⟨(y?k − wTxk)

T (y?k − wTxk)⟩

=⟨y?2k⟩− 2

⟨y?kx

Tk

⟩w + wT

⟨xkx

Tk

⟩w

= p− 2qTw + wTRw (16.45)

wherep =

⟨y?2k⟩, q = 〈xky?k〉 , R =

⟨xkx

Tk

⟩(16.46)

R is the correlation matrix of x. It follows that:

2∇wξ = −2q + 2Rw (16.47)

so at ξmin, ∇wξ = 0 givesRw = q (16.48)

Thus if detR 6= 0, R−1 exists and:

wmin = R−1q (16.49)

In many practical cases, R and q are not known, so wmin must be found usinggradient descent. So locally, at the k–th trial, we have from eqn. 16.45:

2ξk = y?2k − 2y?kxTkw + wTxkx

Tkw

and

2∇wξk = −2xk(y?k − xTkw

)

= −2(y?k − wTxk

)xk

= −2εkxk

Thus the learning rule is just the Adaline rule given in eqn. 16.42:

∆wk = ηεkxk (16.50)

i.e. the Adaline rule minimizes the squared error in solving for the weightvector in the over–determined Perceptron problem.

238

16.2.6 The ‘XOR’ problem

For two variables the Boolean function x1 OR ELSE x2 has the Peirce dia-gram shown in fig. 16.17, and is therefore linearly inseparable. But suppose

Figure 16.17: Peirce diagram for x1 OR ELSE x2.

we add a third variable using the other two variables (and neglect any de-lays). We obtain the circuit shown in fig. 16.18. This solves the problem,

x1

t 1

x2

x3

11

-2-2

1

t 2

1

y

-1

Figure 16.18: MP–net which implements x1 OR ELSE x2.

which is known as the ‘XOR’ problem. It corresponds to the implementationof the following Peirce diagram and truth table:

239

Figure 16.19: Three variable Peirce diagram for x1 OR ELSE x2.

x1 x2 x3 = x1 AND x2 y1 1 1 01 0 0 10 1 0 10 0 0 0

Table 16.4: Truth Table for x1 ORELSE x2

The geometry of the solution can be seen more clearly in the three dimen-sional space of x1, x2 and x3, as shown in fig. 16.20. Evidently in this case,a separating hyper–plane can be inserted between the firing and non–firingsets. Thus x1 OR ELSE x2 becomes linearly separable in three dimensions—i.e. if we use one MP neuron, as shown in fig. 16.21: and only the fourinput–output pairs:

(0, 0, 0 : 0), (1, 1, 1 : 0), (1, 0, 0 : 1), (0, 1, 0 : 1)

for training, we can obtain a least mean square error solution to the XORproblem.

240

010

100

000

111

1

1

0

0

x1

x2

x3

Figure 16.20: Geometry of the three–dimensional implementation of x1 OR ELSEx2.

x1

x2

x3

t

y

3w

2w

1w

Figure 16.21: Three variable MP neuron implementing x1 OR ELSE x2.

241

We can also solve the XOR problem in another fashion. Consider the netshown in fig. 16.22: We find that y = x3 AND (NOT x4), x3 = x1 OR x2,

x1

x3 x4

x2

t3

t1

y

t2

1 1

1 1

-1.2

0.6 -0.2

-0.5

-0.4

Figure 16.22: A two layer MP net implementing x1 OR ELSE x2.

x4 = x1 AND x2, so y =(x1 OR x2) AND (NOT(x1 AND x2)).

Note that the lines v3 = x1 +x2−0.4 = 0 and v4 = x1 +x2−1.2 = 0 separateor classify the region (x1, x2) as shown in fig. 16.23. Neither the Perceptronnor the Adaline algorithms can learn these hidden unit weights.

242

x1

x2

(0,1) (1,1)

(1,0)(0,0) 0.4

0.4

1.2

1.2

1

1 0

0

Figure 16.23: Separating lines for the XOR problem.

16.2.7 Why are Perceptrons and Adaline’s useful?

• In what follows we use (more or less) the terminology found in theMATLAB Neural Network Toolbox. Let X be the set of input vectorsxi. Let T be the target set of desired outputs y?j . We refer to the orderedpairs xi, y?j as the training set Ω. Let Y be the set of actual outputsyj. Then εj = y?j − yj is the (incremental) error and ζ =<

∑j ε

Tj εj >

is the mean squared error. The goal of the training algorithms is tominimize this error.

• Generalization

After using a given training set comprising ordered pairs in Ω if onenow uses input vectors drawn from a set X which is similar to X, thenthe new outputs in Y will be similar to those in the target set T . Thesystem classifies input vectors in X as belonging to X.

• Linear Prediction

We can use this property to implement a variety of signal processingtask, e.g. linear prediction. Consider first a net comprising a singleAdaline and suppose we are given as a target set T , 5 seconds of a sine

243

wave, sampled at the rate of 40 Hz, i.e. 40 samples per second. Thus

T (t) = sin[4πt], t = 0 : 0.025 : 5 (16.51)

and T is sampled every 25 msec. At any instant t we suppose the netgets the past 5 values of T (t) as inputs xi. So the set X consists of allvectors xk = xk−5, xk−4, · · · , xk−1. The desired output y?k = xk. Thetask is therefore to predict the next value of T (t) given the previous5 samples. We first note that given enough samples from the set Ωwe can solve the problem simply by minimizing ζ. The results areshown in fig. 16.24. It will be seen that the error converges rapidly

0 1 2 3 4

0.3

0.2

0.1

0.0 t

ζ

Figure 16.24: Convergence of the error ζ in predicting a sine wave..

to zero. i.e. the error minimizing computation can rapidly solve theprediction problem as it accumulates data points. The computation willnot provide convergence to zero error for nonlinear prediction problems,but the result may be good enough.

• Adaptive Linear Prediction

Predicting a sine wave can, of course, also be carried out using thePerceptron or Adaline algorithms. Such algorithms are more powerfulthan the simpler mean square error minimizer in that they can adapt to

244

time varying inputs. Consider as an example the problem of predictinga frequency modulated sine wave, e.g. suppose

T (t) =

sin[4πt] if 0 ≤ t ≤ 4sin[8πt] if 4 < t ≤ 6

(16.52)

If one samples this signal at a rate of 20 Hz for the first segment and40 Hz for the second and trains an Adaline to predict T (tk) given theprevious 5 data points, with some learning rate η, the result obtainedis shown in fig. 16.25 in which the incremental error εk is plotted as afunction of t. The algorithm converges to zero error after 30 samples,

0 4

1

0

-1 t

ε

Figure 16.25: Convergence of the incremental error ε in predicting a modulatedsine wave..

equivalent to only 1.5 msecs.

• Linear System Identification

In similar fashion one can implement linear system identification. Sup-pose the input signal is

x(t) = sin[10t sin t] (16.53)

The target output is the finite impulse linear filtered response to suchan input signal, given (in MATLAB terminology and notation) as:

y?(t) = filter([1 0.5 − 1.5], 1, x) (16.54)

245

Again the input is provided as three consecutive samples of x, and thetarget is the filtered output signal. The error in the prediction gives thefilter. This setup works remarkably well for linear filter identification,and can be extended to work adaptively using the Adaline trainingalgorithm in case the filter characteristics change from time to time.

16.3 Backpropagation

16.3.1 Feedforward nets

We now return to the problem of finding an algorithm that will implement thecomputation of the XOR function. The question to be answered is why don’tthe previous algorithms work? It turns out that the answer is surprisinglysimple—the functions Θ[v] and σ[v] are discontinuous—and therefore notdifferentiable. In the 1960s a new approach to neural nets was initiated(Cowan 1968) in which eqn. 16.5 is replaced by

xi(n+ 1) = S

[N∑

j=1

wijxj(n)

](16.55)

where S[v] is a smooth differentiable function of the excitation v, e.g.

S[v] = [1 + exp(−γv)]−1 (16.56)

with derivativeS ′[v] = γS(1 − S) (16.57)

or else the function

tanh

[1

2γv

]=

1 − exp(−γv)1 + exp(−γv) (16.58)

with derivative

S ′[v] =1

2γ cosh−2(

1

2γv) (16.59)

It will be evident that these functions are continuous versions of Θ and σrespectively.

In this new formulation with S given by eqn. 16.56 xi can be interpreted as:

246

1. The mean rate of activation of the ith neuron

2. The probability of activation of the ith neuron

Unfortunately this model was introduced and applied to biological model-ing. It was not until the 1980’s that it was applied to the Perceptron problemby Rumelhart, Hinton, and Williams (1986). Consider, as an example of itsutility, gradient descent in weight space of the error function:

ζ =1

2(y? − y)2

=1

2(y? − S[(w, x)])2 (16.60)

whence

∆w = −η∇wζ

= η (y? − S[(w, x)])S ′[v]x

= η (y? − y(v))S ′x (16.61)

This is essentially the Perceptron learning rule, with x replaced by S ′x whereS ′ is given by eqn. 16.57.

Example

Suppose we approximate S by the piecewise linear function: i.e. this functionis:

S[v] =

1 v > 1/γγv −1/γ < v < 1/γ−1 v < −1/γ

(16.62)

Then

S?[v] =

0 v > 1/γγ −1/γ < v < 1/γ0 v < −1/γ

(16.63)

Let γ = 0.5, and η = 2, then

∆w =

12(y? − y)x −2 < v < 2

0 |v| ≥ 2(16.64)

If we apply this algorithm to our standardized test problem we find that itrapidly converges to a solution.

247

x1

t 1

x2

x3

w1

w2

w3

w4

w5

w6

w7t 2

y

Figure 16.26: Architecture of MP–net which implements x1 OR ELSE x2.

The XOR problem revisited

Consider again the architecture shown in fig. 16.18 This architecture hasfour external variables x1, x2, t1 = 1 and t2 = 1. So there are only fourconfigurations as before:

(1,−1, 1, 1), (1,−1, 1, 1), (−1, 1, 1 : 1), (−1,−1, 1 : 1)

but now there are seven weights

w1, w2, w3, w4, w5, w6, w7

with two firing conditions involving:

v1 = w1x1 + w2x2 + w3

v2 = w4x1 + w6x2 + w5v1 + w7 (16.65)

248

i.e.

v1 = S1[v1] = S1 [w1x1 + w2x2 + w3]

y = S2[v2] = S2 [w4x1 + w6x2 + w5v1 + w7]

= S2 [w4x1 + w6x2 + w5S1 [w1x1 + w2x2 + w3] + w7]

(16.66)

It follows that

ζ =1

2(y? − y)2

=1

2(y? − S2 [w4x1 + w6x2 + w5S1 [w1x1 + w2x2 + w3] + w7])

2

(16.67)

We now apply the Adaline learning rule (eqn. 16.36) in the form:

∆wi = −η ∂

∂wiζ

= η (y? − y)∂y

∂wi

= ηε∂y

∂wi(16.68)

But

∂y

∂w1= S ′

2w5S′1x1,

∂y

∂w2= S ′

2w5S′1x2

∂y

∂w3= S ′

2w5S′1,

∂y

∂w4= S ′

2x1

∂y

∂w5= S ′

2S1 [w1x1 + w2x2 + w3]

∂y

∂w6= S ′

2x2,∂y

∂w7= S ′

2

(16.69)

It follows from eqns. 16.68 and 16.69 that the XOR problem can now besolved via the LMS training rule. Rumelhart et. al. found the set of weights

−6.4,−6.4, 2.2,−4.2,−9.4,−4.2, 6.3

249

x1

t1

x2

x3

-6.4 -4.2

-9.4

-4.2-6.4

6.3

2.2

t2

y=y*

Figure 16.27: Weight pattern of MP–net which implements x1 OR ELSE x2.

after 558 cycles throught the four patterns in the configuration. The resultingMP net is shown in fig. 16.27 Evidently if the function x1 OR ELSE x2 canbe implemented in a two layer net with one hidden unit (neglecting delays),so can the universal function NOT (x1 OR ELSE x2)—thus all the Booleanfunctions of two variables can be implemented in a two layer MP net using theLMS training algorithm. The term backpropagation, originally introduced byRosenblatt refers to the second stage of the weight update algorithm, i.e. instage one x produces v = (w, x) and then y; in stage two ζ = (1/2)(y? − y)2

is used to update y. This is the backpropagation stage—since the outputy is used to compute ∆w. It follows from the Hilbert–Ackerman expansiontheorem that since any Boolean function of N variables can be implementedin a two layer net of MP neurons, it can be ‘learned’ by such a net viabackpropagation.

250

16.3.2 Recurrent nets

What about the quantifiers ∃ and ∀ and related recurrent nets? These canalso be learned via backpropagation as was first shown by Pineda (1987).We first extend eqn. 16.55 to distinguish inputs and outputs from internalactivations—we write it as:

xi(n+ 1) = Si

[N+1∑

j=1

wijxj(n) + Ii(n)

](16.70)

where Ii is the input to the ith MP neuron in the net. Let the subset of inputneurons be denoted by X and the subset of output neurons by Y . All otherneurons are then hidden. If the ith neuron is in X, then Ii 6= 0, so let

ΘiX =

1 i ∈ X0 i 3 X

(16.71)

and

ΘiY =

1 i ∈ Y0 i 3 Y

(16.72)

thenIi = ξiΘiX (16.73)

It follows that if the net is stimulated at time n with input ξi(n) there willinitially be a transient response. Eventually the net will settle into a steadystate, in which case eqn. 16.70 becomes:

xi(m) = Si

[N∑

j=1

wijxj(m) + Ii(m)

](16.74)

(We suppress the threshold or bias weights.) We apply the backpropagationalgorithm to the net. Let

εi = (x?i − xi(m)) ΘiY (16.75)

and

ζ =1

2

∑

i

ε2i (16.76)

and use the weight update rule:

∆wij = −η ∂ζ

∂wij(16.77)

251

In deriving the final formulas we need to be very careful about indices.

Consider as an example a recurrent net with only two neurons. So

ζ =1

2ε21 +

1

2ε22

=1

2(x?1 − x1(m)) ΘiY +

1

2(x?2 − x2(m))ΘiY

so

− ∂ζ

∂wij= ε1

∂ε1∂wij

+ ε2∂ε2∂wij

with εi given by eqn. 16.75 above, and xi(m) given by eqn. 16.74, i.e.:

x1(m) = S1 [w11x1(m) + w12x2(m) + I1(m)]

x2(m) = S1 [w21x1(m) + w22x2(m) + I2(m)] (16.78)

From eqn. 16.75 we have:

∂ε1∂wij

= −∂x1(m)

∂wijΘ1Y

∂ε2∂wij

= −∂x2(m)

∂wijΘ2Y (16.79)

How do we compute these derivatives? From eqn. 16.78 we see that:

dx1(m) = S ′1 [dw11x1(m) + dw12x2(m) + w11dx1(m) + w12dx2(m)]

dx2(m) = S1 [dw21x1(m) + dw22x2(m) + w21dx1(m) + w22dx2(m)]

(16.80)

or in matrix form:(dx1(m)dx2(m)

)=

(S ′

1 ·· S ′

2

)(dw11 dw12

dw21 dw22

)(x1(m)x2(m)

)

+

(S ′

1 ·· S ′

2

)(w11 w12

w21 w22

)(dx1(m)dx2(m)

)

whence after some manipulation:(dx1(m)dx2(m)

)=

(1 − S ′

1w11 −S ′1w12

−S ′2w21 1 − S ′

2w22

)−1(S ′

1 ·· S ′

2

)

(dw11 dw12

dw21 dw22

)(x1(m)x2(m)

)(16.81)

252

or in symbolic form:dx(m) = L−1S ′dWx(m) (16.82)

whereL = I − S ′W (16.83)

From eqn. 16.81 we can formally obtain the derivatives as:

∂

∂W x(m) = L−1S ′x(m) (16.84)

Example

To obtain∂

∂w11x1(m)

we set dx2(m), dw12, dw21, and dw22 = 0 in eqn. 16.81 which reduces it to:

(1 − S ′1w11dx1(m) = S ′

1dw11x1(m)

whence

∂

∂w11x1(m) =

S ′1

1 − S ′1w11

x1(m) =(L−1

)11S ′

1x1(m) (16.85)

In general, eqn. 16.84 can be expressed in indices as:

∂

∂wjkxi(m) =

(L−1

)ijS ′jxk(m) (16.86)

It follows from all this that

∆wjk = −η ∂

∂wjkζ

= η∑

i

εi∂

∂wjkxi(m)

= η∑

i

εi(L−1

)ijS ′jxk(m) (16.87)

Letyj(m) =

∑

i

εi(L−1

)ijS ′j (16.88)

253

whence ∆wjk can be expressed as:

∆wjk = ηyj(m)xk(m)

or after permuting indices, as:

∆wij = ηyi(m)xj(m) (16.89)

We further note that eqn. 16.88 gives:

yi(m) =∑

j

εj(L−1

)jiS ′i (16.90)

In matrix form this is yT (m) = εT (L−1)T S ′ whence y = S ′L−1ε = L−1S ′ε

since S ′ is diagonal. Thus Ly(m) = S ′ε. But from eqn. 16.83, L = I − S ′Wso that (I − S ′W)y(m) = S ′ε or:

y(m) = S ′ (Wy(m) + ε)

(16.91)

or in indices:

yi(m) = S ′i

(∑

j

wjiyj(m) + εi

)(16.92)

We note that this is the steady state version of the equation:

yi(n + 1) = S ′i

(∑

j

wjiyj(n) + εi

)(16.93)

In the limit as n→ ∞ we obtain eqn. 16.92.

Thus the recurrent backpropagation algorithm comprises three parts:

1. Solve eqn. 16.70 forxi(m), Ii = ξiΘiX

2. Solve eqn. 16.93 for

yi(m), εi = (x?i − xi(m)) ΘiY

3. Update the weights wij via eqn. 16.89:

∆wij = ηyi(m)xj(m)

254

This generalizes the Rumelhart et.al. algorithm to recurrent nets.

We can rewrite this derivation entirely in vector notation. We start from thenet update equation in the form:

xn+1 = S(Wxn + ξΘX

)

together with the error vector ε = (x? − xm) ΘY and the error function ζ =12εT ε. The weight update equation is then calculated as

∆W = −η∇Wζ = ηεT∇Wxm = ηxmεT(L−1

)S ′ = ηxmy

T

m

so thatym

= S ′ (L−1)Tε =

(L−1

)S ′ε

orLTy

m= (I − S ′W)

Tym

= S ′ε

whenceym

= S ′(WTy

m+ ε)

leading to the need to solve

yn+1

= S ′(WTy

n+ ε)

What does the algorithm look like if there are no recurrent connections?Does it give the usual backpropagation algorithm? Consider the feedforwardnet shown in fig. 16.28 for the XOR problem (neglecting the bias voltages:)

255

x1

x2

x3

w31

w32

w41

w43

w42

4x

Figure 16.28: Architecture of MP–net which implements x1 OR ELSE x2.

Then

W =

· · · ·· · · ·w31 w32 · ·w41 w42 w43 ·

(16.94)

This matrix is lower triangular, with

1, 2 ∈ X, 4 ∈ Y

and we have to solvexn+1 = S

[Wxn + ξ

]

with

ξ =

ξ1ξ2··

(16.95)

256

i.e.

x1(n+ 1) = S1[ξ1]

x2(n+ 1) = S2[ξ2]

x3(n+ 1) = S3[w31x1(n) + w32x2(n)]

x4(n+ 1) = S4[w41x1(n) + w42x2(n) + w43x3(n)] (16.96)

whence

x1(m) = S1[ξ1]

x2(m) = S2[ξ2]

x3(m) = S3[w31x1(m) + w32x2(m)]

x4(m) = S4[w41x1(m) + w42x2(m) + w43x3(m)] (16.97)

with no inversion needed.

We also have to solveyn+1

= S ′(WTy

n+ ε)

with

ε =

···ε4

=

···

x?4 − x4(m)

(16.98)

i.e.

y1(n+ 1) = S ′1 (w31y3(n) + w41y4(n))

y2(n+ 1) = S ′2 (w32y3(n) + w42y4(n))

y3(n+ 1) = S ′3 (w43y4(n))

y4(n+ 1) = S ′4 (x?4 − x4(n))

(16.99)

whence

y1(m) = S ′1 (w31y3(m) + w41y4(m))

y2(m) = S ′2 (w32y3(m) + w42y4(m))

y3(m) = S ′3 (w43y4(m))

y4(m) = S ′4 (x?4 − x4(m))

(16.100)

257

Finally we have the weight update equations:

∆w31 = ηy3(m)x1(m) = ηS ′3w43S

′4 (x?4 − x4(m))S1[ξ1]

∆w32 = ηy3(m)x2(m) = ηS ′3w43S

′4 (x?4 − x4(m))S2[ξ2]

∆w41 = ηy4(m)x1(m) = ηS ′4 (x?4 − x4(m))S1[ξ1]

∆w42 = ηy4(m)x2(m) = ηS ′4 (x?4 − x4(m))S2[ξ2]

∆w43 = ηy4(m)x3(m) = ηS ′4 (x?4 − x4(m))S3 [w31S1[ξ1] + w32S2[ξ2]]

(16.101)

These formula (apart from the neglected bias terms) are exactly what weobtained from ordinary backpropagation.

16.3.3 The Williams–Zipser algorithm

Pineda’s version of the recurrent backpropagation algorithm requires thatthe net settle into steady states x(m). In general, this does not occur, exceptin cases in which the weight matrix W is diagonally dominant. Williams &Zipser (1989) therefore introduced a variant of the Pineda algorithm whichworks in a more general setting.

We first rewrite eqn. 16.70 in the form:

xi(n+ 1) = Si

[∑

j∈X+N

wijzj(n)

](16.102)

where X is the set of inputs, N the set of neurons in the net, and

zi(n) =

Ii(n) i ∈ Xxi(n) i ∈ N

(16.103)

We suppose there are p inputs and N neurons. Bias weights can be includedin X. Let W be the N × (N + p) rectangular matrix of net weights. We cantherefore rewrite eqn. 16.102 in the vector form:

x(n+ 1) = S [Wz(n)] (16.104)

As before let Y (n) be the set of output neurons with:

εi(n) = (x?i − xi(n))ΘiY (16.105)

258

This set can be time varying, so even

x?i ≡ x?i (n) (16.106)

i.e. the target values can be time varying.

Let

ζ(n) =1

2

∑

i∈Yε2i (n) (16.107)

and

ζtotal(0, n1) =n1∑

n=1

ζ(n) (16.108)

We seek to minimize ζtotal from n = 1 to n = n1 using gradient descenton the weight matrix W, i.e., by computing:

∆W = −η∇Wζtotal(0, n1) (16.109)

It follows from eqn. 16.108 that one way to do this is by accumulating valuesof ∇Wζ(n) for each time instant along the trajectory from 0 to n1 so that:

∆wjk =

n1∑

n=1

∆wjk(n)

= −αn1∑

n=1

∂

∂wjkζ(n)

= α

n1∑

n=1

∑

i∈Yεi(n)

∂

∂wjkxi(n) (16.110)

But from eqn. 16.102 it follows that:

∂

∂wjkxi(n + 1) = S ′

i

(∑

l∈Ywil

∂

∂wjkxl(n) + δjizk(n)

)(16.111)

[Compare with eqn. 16.84]

Since we assume that the initial state z(0) has no functional dependence onthe weights we can assume that:

∂

∂wjkxi(0) = 0 (16.112)

259

These equations hold for all i, j ∈ Y, k ∈ X +N .

Now let:

yijk(n) =∂

∂wjkxi(n) (16.113)

[Compare with eqn. 16.88]

So eqn. 16.111 can be rewritten as:

yijk(n+ 1) = S ′i

(∑

l∈Ywily

ljk(n) + δjizk(n)

)(16.114)

with initial conditionyijk(0) = 0 (16.115)

and eqn. 16.110, the weight update equation, becomes:

∆wjk(n) = α∑

i∈Yεi(n)yijk(n) (16.116)

A final application of eqn. 16.110 gives:

∆wjk = αn1∑

n=1

∑

i∈Yεi(n)yijk(n)

= α

n1∑

n=1

∑

i∈Y(x?i (n) − xi(n)) ΘiY y

ijk(n) (16.117)

• Real time recurrent learning

The above algorithm was derived given weights that remain essentiallyfixed during the epoch (1, n1). One can also update the weight changesonline via eqn. 16.116. Provided the learning rate α is sufficientlysmall, i.e. the time constant of ∆w is much less than that of ∆x, therecurrent net can be trained online.

16.3.4 Another derivation of the Williams–Zipser al-

gorithm

There is another way to derive the Williams–Zipser algorithm (Pineda 1988)which is perhaps more transparent. Before describing it, we need to recall

260

some aspects of matrix calculus. In particular we note that the matrix expo-

nential is defined by the Taylor series:

eA =∞∑

k=0

Ak

k!(16.118)

It follows from this definition that:

d

dteA =

dAdt

· eA (16.119)

and that if

K(t, t0) = exp

[∫ t

t0

L(τ)dτ

](16.120)

thenK−1(t, t0) = K(t0, t) (16.121)

andd

dtK(t, t0) = LK(t, t0) (16.122)

We can use these definitions and results to show that the matrix differentialequation:

d

dtP(t) = L(t)P(t) + S(t) (16.123)

where P(t0) = P0 has a formal solution of the form:

P(t) = K(t, t0)P(t0) +

∫ t

t0

K(t, τ)S(τ)dτ (16.124)

Similarly, if the final condition P(tf) = Pf is specified, the formal solution is:

P(t) = K(t, tf)P(tf) −∫ tf

t

K(t, τ)S(τ)dτ (16.125)

These results can be easily verified by direct matrix differentiation. Thematrix K is of course the matrix Green’s function. It is also referred to asthe propagator or transition matrix.

Consider now the differential equation version of eqn. 16.70 in the form:

d

dtxi = −xi + S

[∑

j

wijxj + Ii

](16.126)

261

The problem is to find a set of weights w which cause the trajectory x(t) ofthe system to satisfy xi(t) = x?i (t) for some set of components i ∈ Y . Thiscan be solved by minimizing the expression

ζ [w] =1

2

∑

i∈X

∫ tf

t0

dt [x?i (t) − xi(t)]2 (16.127)

This expression is a function of the matrix w and a linear functional of thetrajectory x(t). To minimize it we need to calculate the derivative ∂ζ/∂wkl.Differentiating eqn. 16.127 we obtain:

∂ζ

∂wkl=

∑

i∈X

∫ tf

t0

dt [x?i (t) − xi(t)]∂xi∂wkl

=∑

i

∫ tf

t0

dtεiPikl (16.128)

where εi is defined as in eqn. 16.105 and

P ikl =

∂xi∂wkl

(16.129)

To evaluate P ikl we differentiate eqn. 16.126 with respect to w. After some

algebra we obtain the equation:

dP ikl

dt=∑

j

LijPjkl + Sikl (16.130)

withLij = −δij + S ′(ui)wij (16.131)

whereui =

∑

j

wijxj (16.132)

andSijk = δijxk (16.133)

Eqn. 16.130 may be directly integrated as in the original derivation of theWilliams–Zipser algorithm. However this is computationally expensive—O(N4) in space–time. Alternatively one can use a backpropagation algo-rithm (Perlmutter 1988). To do this we first note that the formal solution of

262

eqn. 16.130 can be written as:

P ikl =

∑

j

Kij(t, t0)Pjkl(t0) +

∫ t

t0

dτKij(t, τ)Sjkl(τ) (16.134)

where

K(t2, t1) = exp

[∫ t2

t1

L(τ)dτ

](16.135)

We need an initial condition to provide a unique solution. Without loss ofgenerality we can assume P i

kl(t0) = 0. It then follows that

P ikl(t) =

∫ t

t0

dτKik(t, τ)xl(τ) (16.136)

We substitute this solution into eqn. 16.128 to obtain:

∂ζ

∂wkl=

∑

i

∫ tf

t0

dt εi

∫ t

t0

dτKik(t, τ)xl(τ)

=∑

i

∫ tf

t0

dt εi

∫ tf

t0

dτΘ(t− τ)Kik(t, τ)xl(τ) (16.137)

where

Θ(t− τ) =

1 t > τ0 t ≤ τ

(16.138)

We can now interchange the order of integration in this expression to obtain:

∂ζ

∂wkl=∑

i

∫ tf

t0

dτS ′(uk)ykxl(τ) (16.139)

where

yk(τ) =∑

i

∫ tf

t0

dtΘ(t− τ)Kik(t, τ)εi(t)

=∑

i

∫ tf

τ

dtKik(t, τ)εi(t) (16.140)

It follows that y(t) is a solution of the differential equation

d

dty(t) = LTy(t) + ε(t) (16.141)

263

iff y(tf) = 0. From the form of L we see that this is just the usual backprop-agation equation, i.e.

d

dtyi = −yi +

∑

j

wjiS′(uj)yj + εi(t) (16.142)

Eqn. 16.142 must be integrated backwards from the final state y(tf) = 0.

In summary, eqn. 16.126 must first be integrated forwards in time to generatethe trajectory x(t), then eqn. 16.142 must be integrated backwards in timeto generate the error signal y(t). Finally the gradient ∂ζ/∂wkl must becomputed from eqn. 16.139. The result is used to update the weights andthe process is repeated.

16.4 Unsupervised learning

The backpropagation algorithm is an example of learning with a trainer.By contrast reinforcement learning is an example of learning with a criticwhich provides just one bit of information about the success or failure of thealgorithm. The reader is referred to Sutton & Barto (1996) for a descriptionof such an algorithm. Here we discuss some examples of algorithms whichare unsupervised, and receive no error feedback of any kind.

16.4.1 Principal Component Analysis

As a first example we consider Principal Component Analysis or PCA. Thisis an algorithm that operates on the set X of inputs x and decorrelates them.Let

x = (x1, x2, · · · , xN) , 〈x〉 = 0, and⟨xxT

⟩= R[x] (16.143)

with R the covariance matrix of x.

We wish to find a set of vectors y in M ⊆ N –dimensional space that accountsfor as much as possible of the variance of the input set X. The vectors yare called the M first principal components. The first vector y

1is in the

direction of maximum variance, the second vector y2

is orthogonal to y1

andis in the direction of maximum variance of the remainder of the set X, andso on.

264

Let M = N andy = W x (16.144)

ThenyyT = (W x) (W x)T = W xxT WT

SoR[y] =

⟨yyT

⟩= W

⟨xxT

⟩WT = WR[x]WT

If W diagonalizes R[x] then

W R[x] WT = Λ

whenceR[y] = Λ (16.145)

Thus the covariance matrix of the set Y is the diagonal matrix:

Λ =

λ1

λ2

··λM

i.e. the eigenvalues of R[x] comprise Λ and their associated eigenvectors givey.

This procedure works at the level of the covariance R[x]. It is thereforean exact solution to the problem of decorrelating the multivariate Gaussiandensity P (x1, x2, · · · , xN ) to give:

P (y1, y2, · · · , yN) =

N∏

i=1

P (yi) (16.146)

In such a case diagonalizing the covariance matrix of the vectors in the setX is sufficient to produce a set of vectors Y whose underlying probabilitydensity is completely factorized.

265

W

x_

_y

. .

. .

Figure 16.29: Architecture of neural net which implements PCA.

16.4.2 A neural net implementation of PCA

PCA can be easily implemented with neural nets. Consider the simple net ar-chitecture shown in fig. 16.29 in which the neurons are linear as in eqn. 16.144.Consider a single output neuron in which case we can write eqn. 16.144 inthe form

yi =∑

j

wijxj

or (dropping the index i):

y =∑

j

wjxj = (w, x) = wTx = v (16.147)

the input voltage to the neuron.

We now introduce an unsupervised learning rule. We suppose that the weightvector w changes according to the rule (Hebb 1949)

∆w = ηxv = ηxy (16.148)

This is a local rule. The weight change depends only on the product of thelocal input x and the local voltage v or output y of the neuron. [There isnow accumulating evidence that such weight changes occur in the brain.] Itfollows from eqns. 16.148 and 16.147 that

∆w = ηx (w, x) = ηx (x, w) = ηxxTw (16.149)

266

and therefore that:〈∆w〉 = η

⟨xxT

⟩w = ηR[x]w (16.150)

Thus the mean weight change depends upon the input covariance R[x].

Evidently at a fixed point or stationary equilibrium of eqn. 16.150:

〈∆w?〉 = ηR[x]w? = 0 (16.151)

i.e., the weight vector w? is an eigenvector of the covariance matrix R[x]with eigenvalue λ = 0. This implies that w? is unstable with respect toperturbations.

A more stable unsupervised learning rule was introduced by Oja (1982):

∆w = η [x− w (w, x)] (x, w) (16.152)

This rule incorporates weight decay given by the term:

−η (x, w)2 w = −ηy2w

It now follows that

〈∆w〉 = η⟨xxT

⟩w − ηwT

⟨xxT

⟩ww

= η[R[x] − wTR[x]w

]w (16.153)

This has a fixed point w? given by the eqn.

R[x]w? =(w?TR[x]w?

)w?

This equation is true iff w? is an eigenvector of R[x] with eigenvalue λ, i.e.iff w?TR[x]w? = λ. If the weight vectors are normalised, so that:

|w|2 = wTw = 1 (16.154)

then eqn. 16.152 converges to the fixed point solution w? at which:

R[x] → Λ

so that if the input set X is multivariate gaussian, then the output set Y isuncorrelated. Thus the linear net with weight decay implements PCA.

267

16.4.3 Independent Component Analysis

As noted earlier PCA is exact if the probability density P (x1, x2, · · · , xN )is Gaussian. However if P (x1, x2, · · · , xN ) is non–Gaussian a method thatuses higher moments is needed. Such a method is Independent Component

Analysis or ICA. [Bell & Sejnowski 1995.] The method is based on Informa-tion Theory. The basic idea is to maximize the mutual information that theoutput vectors y provide about the input vectors x, namely the function:

I(x, y) = H(x) −H(x|y)= H(y) −H(y|x) (16.155)

whereH(x) = −

∑

X

P (x) logP (x) (16.156)

is the self–information or entropy of the set X and

H(y) = −∑

X

P (y) logP (y) (16.157)

is the self–information or entropy of the set Y . In similar fashion, the func-tions

H(x|y) = −∑

X,Y

P (x, y) logP (x|y) (16.158)

andH(y|x) = −

∑

X,Y

P (x, y) logP (y|x) (16.159)

are, respectively, the conditional entropy or equivocation of the input giventhe output, and the conditional or noise entropy of the output given theinput. In effect H(y) is the differential entropy of the set Y with respect tothe noise level, or with respect to the accuracy of discretization of x andy.Suppose now that y is a nonlinear function of x and noise, i.e.

y = g [x,W] + noise (16.160)

then∂

∂W I(x, y) =∂

∂WH(y) (16.161)

268

since the noise entropy H(y|x) does not depend on W. It follows that insuch a case I(x, y) can be maximized by maximizing H(y).

Consider now a single nonlinear neuron with a sigmoidal gain function and asingle input x, and suppose x has the unimodal probability density functiongiven in fig. 16.30. How do we maximize I(x, y) and H(y)? If the function g

P(x)

x

Figure 16.30: A unimodal probability density function.

is monotonic then

P (y) =P (i)

|∂(y, x)| (16.162)

where

∂(y, x) ≡ ∂y

∂x(16.163)

and

H(y) = −∫P (y) logP (y)dy

= 〈log |∂(y, x)|〉 − 〈logP (x)〉= 〈log |∂(y, x)|〉 +H(x) (16.164)

so that to maximize H(y) we need only maximize 〈log |∂(y, x)|〉. Considertherefore a ‘training’ set of x’s to sample the pdf P (x) and set

∆w = η∂

∂wlog |∂(y, x)|

= η∂

∂w(∂(y, x))/∂(y, x) (16.165)

269

In casey = [1 + exp(−wx− w0)]

−1 (16.166)

then∂y

∂x= wy(1− y)

and∂

∂w

(∂y

∂x

)= y(1 − y) (1 + wx(1 − 2y)))

so

∆w = η

(1

w+ x(1 − 2y)

), ∆w0 = η(1 − 2y) (16.167)

The effect of these rules is easily seen. If P (x) is Gaussian then the ∆w0

centers the inflexion point of the sigmoid function on the peak of P (x), andthe ∆w rule scales the slope of the sigmoid to match the variance of P (x), asshown in fig. 16.31. It follows that ∂(y, x) effectively scales with the sigmoid

P(x)

x

Figure 16.31: Lining up the sigmoid function with P (x).

function, so that P (y) is effectively constant on (0, 1). Such a distributionmaximizes the output entropy H(y).

• The learning rule, eqn. 16.167, is anti–Hebbian, and in contrast withthe Oja rule, has a weight growth term, rather than a weight decayterm.

270

• The anti–Hebbian term∆w = −2ηxy

stops P (y) evolving to either δ(y) or δ(y − 1).

• The weight growth term ∆w = η/w keeps weights from becoming sosmall that P (y) evolves to δ(y − 1/2).

Similar considerations apply to the net shown in fig. 16.29. Let

y = g (Wx + w0) (16.168)

P (y) =P (x)

|J | =P (x)∣∣det[∂(y, x)

]∣∣ (16.169)

and∆W = η

[(WT )−1 + (1 − 2y)xT

], ∆w0 = η(1 − 2y) (16.170)

• In the case of a single neuron w? = 0 is an unstable fixed point, whereasin the net case, any singular weight matrix gives an unstable fixed point.This allows different output units yj to represent different features ofthe input set X.

• The matrix products WWT and W−1W−T are positive definite, pro-vided W is non–singular. In such a case we can post–multiply the righthand side of eqns. 16.170 to give the modified weight update equations

∆W = η(Λ +

[1 − 2y)(Wx)T

])W (16.171)

• When does maximizing the output entropy H(y) reduce redundancy?Consider a system of two outputs y1 and y2. Then

H(y1, y2) = H(y1) +H(y2) − I(y1, y2) (16.172)

Maximizing H(y1, y2) can be done by simultaneously maximizing H(y1)and H(y2) and minimizing I(y1, y2). When I(y1, y2) = 0, y1 and y2 arestatistically independent, so:

P (y1, y2) = P (y1)P (y2) (16.173)

The learning rule given in eqn. 16.170 maximizes H(y1, y2). In general,for super–Gaussian distributions (those with longer tails and sharperpeaks than Gaussians) with positive kurtosis I(y1, y2) is reduced.

271

• If we can match g (Wx + w0) to P (y), i.e. pick g such that:

g (Wx + w0) = g(u) =

∫ u

P (y)dy (16.174)

then I is minimized. This suggests a two stage algorithm for imple-menting ICA:

1. First optimize g(u) to match P (y) in the sense given above.

2. Use the learning rules

This guarantees both maximizing H(yj) and minimizing I(y1, y2).

16.4.4 Blind source separation

A noteworthy application of ICA is blind source separation. Consider thesituation shown in fig. 16.32 in which x is an unknown linear transformationof a set of sources s. Thus

WA

x_s_ _y

..

..

..

Figure 16.32: Architecture of neural net for Blind Source Separation. See textfor details.

x = As (16.175)

where A is an unknown mixing matrix. The problem is to find A and there-fore A−1, so that s can be obtained as:

s = A−1x (16.176)

272

To solve this problem we perform a nonlinear transformation of x to form:

y = g [Wx + w0] (16.177)

so thaty = g [WAs + w0] (16.178)

The matrix W is to be updated using the ICA algorithm described above.After this algorithm is applied we find that

WA → Λ

(up to a permutation of rows and columns) provided the matrix A is non–singular, and that no more than one source in s is Gaussian. Note that ICAwill work if and only if the number of sources is not greater than the numberof detectors.

16.5 Neural nets in finance

16.5.1 Some general remarks

The stock market is governed by random forces, but if traders are ratio-

nal, the efficient market hypothesis (EMH) precludes forecasting. But notall traders act rationally all the time, so there are mismatches between thetrue price and the market price. In addition, not all traders have completeinformation. So there should be some arbitrage opportunities. Howevermarket prices have defied traditional statistical tests in attempts to detectsignificant non–random deviations. Brock, Hsieh & Le Baron (1991), andMoody et. al. (1997) have shown that nonlinearities or trends do exist inmarket price variations that are not captured by random walk models or besimple regression schemes. It is sometimes asserted that the stock marketis chaotic—however the evidence for any low–dimensional attractor is veryweak (Gilmore 1993). There is better evidence for herding or flocking behav-ior (Vaga 1990)—financial time series are characterized by periods of truerandom walk interspersed with periods of coherence due to crowd behavior.

Neural nets can be thought of as a multivariate, nonlinear, nonparamet-ric, data driven, and model free inference technique. Such nets can classify,associate, store and retrieve, encode or compress data, model nonlinear phe-nomena, and generalize from training to test data. Noteworthy applications

273

of neural nets to market price forecasting include those of Lapedes & Farber(1987), Weigend (1991) who used two targets (the return and its sign) andthree inputs (prices, returns, and indicators), and Refenes (1993), who usedas inputs twelve hours of (hourly) tick data, and as output the next tickvalue, in a net with thirty five hidden neurons.

16.5.2 Pre–processing the data

In formulating financial problems for neural nets, the form of the inputsmust be carefully chosen. This usually requires a substantial amount ofpreprocessing. Thus one needs to

• Check data for range and trend, the standard deviation, outliers (greaterthan five sigma), and missing elements

• Look for multiple data sources

• Check the way in which the data is represented

• Choose inputs and outputs that are relevant to the task

Possible inputs:

• p1, p2, · · · , pn ≡ pt

• pt − pt−1

• ∆pt/q, q = pt−1 or p0 or σ

• log pt/pt−1, pt > 0

• √pt, pt ≥ 0

• cos pt orsin pt

• log(pt + C), pt > −C

Indicators:

openingclosinghighestlowest

daily prices

274

One can use such techniques as fuzzy logic (Zadeh 1960), genetic algorithms(Holland 1980) and Kalman filters to select suitable inputs.

Other possible inputs

• The random walk indicator

ρ =∆p(Nτ)

∆pt√N

(16.179)

where ∆pt is sampled at intervals of τ . Thus ∆pt = pt − pt−1, ∆pt =pt−pt−1 where the averaging is over M differing samples of size N , and

∆p(Nτ) =

N∑

t=1

∆pt = pN − p0.

If the process is a true (stationary, ergodic) random walk then

∆pt = (pN − p0) /N

whence ρ =√N . Otherwise ρ 6=

√N .

• The volatility indicatorUse the opening, closing, highest, or lowest prices in any set of samples.Then

vol = (h − l)(c − o)/p.

This indicator is positive in case prices are rising, and negative in casethey are falling.

Input normalizationIf one has M samples each of length N one can normalize the samples byforming

pk =pk∑M pk

orpk∑N pk

or by some combination of the two procedures. One can also normalize adata set so that the sample values lie with a given range [a, b] with

pk =(pk − pmin)(b− a)

pmax − pmin

+ a.

275

Finally one can perform zero–offset normalization. Let

Absmax =

|pmax| , |pmax| ≥ |pmin||pmin| , |pmax| < |pmin|

thenpk =

pkAbsmax

.

Target selection and normalization

Just as with inputs, one also needs to choose carefully the ouputs or tar-gets. In general it is advantageous to minimize the number of targets. Forforecasting on time series a single target works best. Wherever possible oneshould train on prices, and normalize to the range of the output unit.

x_

_y

_h

. .

. .

. .

Figure 16.33: An autoassociator net.

Reducing the input dimension

Even after appropriate selection of inputs and targets it may still be the casethat the number of inputs is quite large, and therefore the weight space has

276

many dimensions. It is therefore necessary to reduce the number of inputsas much as is possible. One way to do this is via an unsupervised variationof backpropagation known as the autoassociator. Fig. 16.33 shows the archi-tecture of such a net. The principle of the autoassociator is straightforward.Simply train the net to generate the identity

y = x,

then use the hidden unit outputs h to represent x. If the dimension of h ischosen to be as small as possible consistent with a small value of ‖y − x‖,then dim h < dim x.

There are many other pre–processing techniques including both PCA andICA. Some others include:

• Noise reduction via an associator net

• Application of spectral analysis adapted to non–stationary time series,e.g. Time–Frequency transforms. Linear versions include the Finite orWindowed Fourier transform, and Wavelets. A nonlinear version is theWigner transform.

The time series generated by such operations may contain features that canbe decorrelated using either PCA or ICA.

16.5.3 Training and testing

The training set Ω is then constructed from the pre–processed inputs, andtheir chosen targets. Such a set should be representative—it should includeboth Bull and Bear markets, the order of presentation of input vectors xshould be randomized, and differing initial weight vectors w0 should be usedfor each trial.

Testing and validating the training algorithm

Once a neural net algorithm has been used to process a data set, the resultshave to be tested and checked for bias. Errors that occur in the trainingprocess are usually referred to as apparent, whereas those that occur duringtraining and testing are called true. Problems can occur if one over–trains ontoo few inputs—one can end up fitting noise. Checking for bias is referred toas validation. One procedure for doing this is known as cross–validation. In

277

such a procedure one takes one input vector in the training set X and reservesit for testing. After training this vector is then used to test the net. Thisprocedure is then repeated with all remaining input vectors in X. The resultis a (virtually) unbiased estimate of the performance of the algorithm, evenwith relatively small samples e.g. N = 100. In financial forecasting, givenN samples of a financial variable, we use about 2N/3 for training, and theremainder for testing. Usually the training set Ω is taken from early portionsof the financial time series, and the testing set Ψ from later portions.

Training and testing can be carried out in a number of ways

• Single steppingAfter training to forecast one time unit ahead, test to forecast one timeunit ahead

• Iterated single steppingUse each test vector to produce a forecast of the next test vector

• Auto single steppingForecast first vector in the test set, then add it to the training set andretrain, ...

16.5.4 General guidelines

• For a given mean square error (MSE), a smaller net will be a bettergeneralizer since it will have fewer weights, so there is less chance ofovertraining.

• One needs more training vectors than weights, i.e.,

n(wk) n(iT ) (16.180)

where n(wk) is the number of weights, and n(iT ) is the number oftraining vectors.

• Evidently there is a trade–off between training and net complexity

• Minimum description length (Rissanen 1989)After training one should end up with a net with the smallest number

278

of free parameters as possible to describe the data. One can achievethis with a cost function:

d = MSE + n(wk) (16.181)

• So one needs to be able to prune the weights. This can be done by aprocedure known as optimal brain damage (LeCun et. al. 1990) whichuses weight decay to eliminate small weights.

• Keep the input layer as small as possible—so input pre–processing isindicated

• Kolmogorov (1956)For N input units one needs one hidden layer with at least 2N + 1hidden units with sigmoid transfer functions to compute any continuousfunction of N variables.

• There is also an upper bound on the number of hidden unit weightsneeded, given in terms of the number of training vectors (Vapnik–Chervonenkis 1993):

n(wk) ≤n(iT )

10(16.182)

This give a more precise upper bound on n(wk) than earlier.

• The number of output units should be as small as possible—one fortime series forecasting.

• If one needs only |Pt| then linear or piecewise linear neurons suffice.

Embellishments

• Adapt the learning rate η (Hertz et. al. 1991)

• Train using noisy data (Elman & Zipser 1988)

• Modify the error ε = y?− y to be less sensitive to outliers, by using theerror function:

ε =2ε

1 + ε2(16.183)

279

• Optimal brain damageLet the saliency of a weight be

sk =

∣∣∣∣∂2d

∂w2k

∣∣∣∣w2k (16.184)

Remove weights with low saliency.

• Normalize all weights using

wk =wkwmax

(16.185)

16.5.5 An example of financial forecasting

(Azoff 1994)We consider S & P 500 futures. The data structure consists of the closing

price of the contract τ days prior to target day c(−τ), the volatility indexv(t) with v = 2.32, the random walk indicator r(t) with r = 16.79, and the

280

spot closing price s(t). We form input vectors x(t) with components:

x01 = v(−2)

x02 = v(−2)

x03 = r(−1)

x04 = [v(−6) + v(−5) + v(−4)]/3

x05 = [v(−5) + v(−4) + v(−3)]/3

x06 = [v(−4) + v(−3) + v(−2)]/3

x07 = [v(−3) + v(−2) + v(−1)]/3

x08 = c(−15)

x09 = c(−10)

x10 = s(−3)

x11 = s(−2)

x12 = s(−1)

x13 = c(−5)

x14 = c(−4)

x15 = c(−3)

x16 = c(−2)

x17 = c(−1)

x18 = c(0)

The first seven components x01–x07 contain sign information and are nor-malized with zero offset. The remaining components x08–x18 are normalizedto the range [−1,+1]. Randomized and time–reversed series are then con-structed from the original series and mapped into pre–processed input vec-tors. Ten differing initial weight vectorsw0 are used in the training procedure.A single piecewise linear target neuron is used with slope 0.5 and saturation±2.5. Various methods such as adaptive steepest descent, conjugate gradient,and quasi–Newton are used to minimize error. Seven hidden neurons areused, four tanh functions, two piecewise linear, and one cos function. Thetotal number of weights is therefore 7 × 18 + 7 = 133. Average training andtesting errors are then computed, with testing via single stepping. The resultprovides good forecasting of index changes.

281

16.5.6 Other applications of neural nets and related

algorithms to finance

One can use neural nets to estimate option prices. Three kinds of algorithmshave been used:

• Radial basis functions (RBF)

• Multilayer Perceptrons (MLP)

• Projection Pursuit Regression (PPR)

Radial basis functions are analogous to multiple regression. yt is to be ‘ex-plained’ in terms of xt, as

yt = f (xt) + εt (16.186)

where 〈εt|xt〉 = 0. The estimation problem may be viewed as the minimiza-tion of

T∑

t=1

(‖yt − f(xt)‖2 + λ‖Lf(xt)‖2

)(16.187)

where ‖ · ‖ is some norm and L is a differential operator. The second termpenalizes deviations from smoothness.

A general solution to eqn. 16.187 can be found in the form:

f(x) =

k∑

i=1

cihi (‖x− zi‖) + p(x) (16.188)

where the zi are d–dimensional vector prototypes or centers, the ci arescalars, the hi(·) are scalar functions, p(·) is a polynomial, and k T . Suchf are called hyperbasis functions (Poggio & Girosi 1990). Here we use

f(x) =k∑

i=1

cihi((x− zi)

TWTW(x− zi))

+ α0 + αT1 x (16.189)

withh(x) = exp(−x/σ2) (16.190)

In effect y is approximated as f(x) given a sparse data set (xt, yt) where f(·)is a sum of centered Gaussian basis functions.

282

• In cases where one knows the range of the targets it is computationallymore efficient to apply some nonlinear function to f(x), e.g. the sigmoid

g(u) =(1 + e−u

)−1

• Thus for given (xt, yt) radial basis function approximation amountsto estimating parameters of the radial basis function net.

As described earlier multi–layer Perceptrons (MLP) may be written as:

f(x) = h

(k∑

i=1

cih(d0i + dTi x) + c0

)(16.191)

where h(u) = g(u).

• Again one has to do parameter estimation on the c’s and d’s—theweights.

Projection pursuit regression (PPR) was introduced by Friedman and Tukey(1974) to find ‘interesting’ directions in an input space (x1, x2, · · · , xn) byprojecting input vectors x onto some directional vector w by forming wTx,and looking at the properties of the projection. PPR is a regression on wTx,i.e.

f(x) =

k∑

i=1

cihi(dTi x

)+ c0 (16.192)

where the hi are also to be estimated.

• All these algorithms can be used to approximate arbitrary functionsf(x) to some level of accuracy, provided they are smooth enough.

16.5.7 Learning the Black–Scholes formula

(Hutchison, Lo, and Poggio 1994)Assume the interest rate r and the volatility σ are fixed throughout the train-ing sample. Also assume that the statistics of price returns are independentof the price S. This means that one need only look at (S/E, 1) rather than(S,E). Thus we have to estimate a function f (S/E, 1, τ) where τ is the time

283

to expiry. First generate sample paths of S and the associated option pricesC. Use the standard equation:

dS = µSdt+ σSdW

for S and Chicago Board of Options Exchange (CBOE) rules for C. About6000 (S,C) data points suffice, using a net of the form shown in fig. 16.34.

x_

_h . .

y=f(x)_^

S/E τ

C/E

Figure 16.34: Architecture of the net that learns the Black–Scholes formula.

• If V (S, t) is a replicating portfolio, and if

η = e−rτ⟨V (S, t)2

⟩1/2

is the prediction error, then one can compare estimators from the neuralnet and from the Black–Scholes formula.

• In the situation described above the Black-Scholes formula is actuallya better predictor than any of the learning algorithms. However thealgorithms do perform nearly as well.

284

• If the learning algorithms are now trained on real data—S & P 500futures and futures options from January 1987 through December 1991using estimates of r and σ from yields and observations, they outper-form the Black–Scholes formula. This is not surprising since r and σare no longer constant in the real data.

• One can also do this using level dependent volatility (White 1995).

285

Documents

Course notes on Financial Mathematics