28
Step by step implementation and optimization of simulations in quantitative finance Lokman Abbas-Turki UPMC, LPMA 10-12 October 2017 Lokman (UPMC, LPMA) GTC Europe Lab 1 / 28

Stepbystepimplementationand ...on-demand.gputechconf.com/gtc-eu/2017/presentation/53028-lokman... · Stepbystepimplementationand optimizationofsimulationsinquantitative finance LokmanAbbas-Turki

  • Upload
    voxuyen

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Step by step implementation andoptimization of simulations in quantitativefinance

Lokman Abbas-Turki

UPMC, LPMA

10-12 October 2017

Lokman (UPMC, LPMA) GTC Europe Lab 1 / 28

Plan

Introduction

Local volatilitychallenges GPUs

Monte Carlo andOpenACC

parallelization

Monte Carlo andCUDA parallelization

PDE formulation &Crank-Nicolson

scheme

PDE simulation andCUDA parallelization

Lokman (UPMC, LPMA) GTC Europe Lab 2 / 28

Introduction

Plan

Introduction

Local volatilitychallenges GPUs

Monte Carlo andOpenACC

parallelization

Monte Carlo andCUDA parallelization

PDE formulation &Crank-Nicolson

scheme

PDE simulation andCUDA parallelization

Lokman (UPMC, LPMA) GTC Europe Lab 3 / 28

Introduction

What makebankers change

their mind?I The CVA (Credit Valuation Adjustment) or XVA (X=C, D, F, K, M

Valuation Adjustment) are tipping point applications,I FRTB (Fundamental Review of the Trading Book) is the other important

application with a deadline in 2019.I Electronic trading and deep learning.I Lloyds Blankfein declared about Goldman Sachs: “We are a technology

firm”.

HPC in banksI From distribution to parallelization.I From small to big nodes.I The efficiency of GPUs becomes undeniable.I Use the .net C, C++ and C#.

Remainingchallenges and

fearsI Code management.I Possible conflicts within quant teams.I Can we extend the results for toy models to more general models?

Lokman (UPMC, LPMA) GTC Europe Lab 4 / 28

Local volatility challenges GPUs

Plan

Introduction

Local volatilitychallenges GPUs

Monte Carlo andOpenACC

parallelization

Monte Carlo andCUDA parallelization

PDE formulation &Crank-Nicolson

scheme

PDE simulation andCUDA parallelization

Lokman (UPMC, LPMA) GTC Europe Lab 5 / 28

Local volatility challenges GPUs

Local volatility from implied volatilityFirst array: rg dSt = St rg (t)dt + Stσloc (St , t)dWt , S0 = x0.

I S is the stock price process where x0 is the spot priceI W is a Brownian motion with W0 = 0I rg is the risk-free rate, assumed piecewise constantI σloc (x , t) is a local volatility function: R∗+ × R+ → R∗+

Dupire Equation Given a family C(K ,T )K ,T of call prices with strike K and maturity T

σ2loc (K ,T ) = 2

∂C/∂T + Krg (T )∂C/∂K

K2(∂2C/∂K2)

From implied tolocal

Using the Black & Scholes implied volatility σimp(x , t), Andersen andBrotherton-Ratcliffe (1997) showed that

σ2loc (x , t) =

2∂σimp

∂t+σimp

t+ 2xrg (t)

∂σimp

∂x

x2

[∂2σimp

∂x2 − d+

√t

(∂σimp

∂x

)2+

1σimp

(1

x√t

+ d+∂σimp

∂x

)2]

with d+ = 12σimp

√t +

[log(x0/x) +

∫ t0 rg (u)du

]/[σimp

√t]

Lokman (UPMC, LPMA) GTC Europe Lab 6 / 28

Local volatility challenges GPUs

SVI to simulate implied volatility

+2 arrays: Tg , Kg For (x , t) ∈ Kg × Tg , we assume that the observed implied volatility

σimp(x , t) =

√w(x , t)

t

where the cumulative variance is parameterized by

w(x , t) =θt

2

1 + ρϕ(θt)(log( x

x0)−

∫ t0 rg (u)du

)+√[

ϕ(θt)(log( x

x0)−

∫ t0 rg (u)du

)+ ρ]2

+ (1− ρ2)

(1)

with ϕ(θt) =η

θγt (1 + θt)1−γand θt = a2

(t + b(1− e−λt)

).

All parameters of (1) are discussed in Gatheral & Jacquier (2013)

Interpolation on]0,max(Kg )]×]0,max(Tg )]

We compute some values of σimp(x , t) on the grid ∈ Kg × Tg . Then, weinterpolate these values to obtain σ̄(x , t) defined on]0,max(Kg )]×]0,max(Tg )] and we assume

σimp(x , t) ≈ σ̄(x , t) when (x , t) ∈]0,max(Kg )]×]0,max(Tg )]

Lokman (UPMC, LPMA) GTC Europe Lab 7 / 28

Local volatility challenges GPUs

Bicubic interpolation for implied volatility

Let k , q with (x , t)∈]Kg [k],Kg [k + 1]]×]Tg [q],Tg [q+ 1]]

σ̄(x , t) =3∑

i=0

3∑j=0

Cg (k, q, i , j)l iuj

where l =t − Tg [q]

Tg [q + 1]− Tg [q]and u =

x − Kg [k]

Kg [k + 1]− Kg [k]

Fourth array: Cg Cg (k, q, i , j) = Cg [k ∗ (nt − 1) ∗ 16 + q ∗ 16 + i ∗ 4 + j]

with (k, q, i , j) ∈ {0, ..., nk − 1} × {0, ..., nt − 1} × {0, ..., 3} × {0, ..., 3}and nk , nt are the size of Kg and Tg

Approximated localvolatility

σ2loc (x , t) ≈ min(max(σ̃2(x , t), 0.0001), 0.5)

where σ̃2(x , t) =2∂σ̄

∂t+σ̄

t+ 2xrg (t)

∂σ̄

∂x

x2

[∂2σ̄

∂x2 − d+

√t

(∂σ̄

∂x

)2+

1σ̄

(1

x√t

+ d+∂σ̄

∂x

)2]

with d+ = 12 σ̄√t +

[log(x0/x) +

∫ t0 rg (u)du

]/[σ̄√t]

Lokman (UPMC, LPMA) GTC Europe Lab 8 / 28

Monte Carlo and OpenACC parallelization

Plan

Introduction

Local volatilitychallenges GPUs

Monte Carlo andOpenACC

parallelization

Monte Carlo andCUDA parallelization

PDE formulation &Crank-Nicolson

scheme

PDE simulation andCUDA parallelization

Lokman (UPMC, LPMA) GTC Europe Lab 9 / 28

Monte Carlo and OpenACC parallelization

Pricing bullet option with Monte Carlo (MC)

Price for t ∈ [0,T ) Ft = e−∫ Tt rg (u)duE

((ST − K)+1{IT∈[P1,P2]}|Ft

)with It =

∑Ti≤t

1{STi<B}

I K , T are respectively the contract’s strike and maturityI 0 < T1 < ... < TM < T is a predetermined scheduleI The barrier B should be crossed IT times with {P1, ...,P2} ⊂ {0, ...,M}

(St , It) is Markov F (t, x , j) = e−∫ Tt rg (u)duE(X

∣∣St = x , It = j), X = (ST − K)+1{IT∈[P1,P2]} (2)

MC procedure Simulate F (0, x0, 0) = E(X ) using a family {Xi}i≤n of i.i.d ∼ X

I Strong law of large numbers:

P

(lim

n→+∞

X1 + X2 + ...+ Xn

n= E(X )

)= 1

I Central limit theorem: Denoting εn = E(X )−X1 + X2 + ...+ Xn

n√n

σεn → G ∼ N (0, 1), σ is the standard deviation of X

I There is a 95% chance of having: εn ≤ 1.96σ√n

Lokman (UPMC, LPMA) GTC Europe Lab 10 / 28

Monte Carlo and OpenACC parallelization

Discritization set 0 = t0 < t1 < ... < tNt = T finer than0 < T1 < ... < TM < T with δt =

√tk+1 − tk

Iterating for path-dependant contract (x0 = 50)

For each tk ,k = 0, ...,Nt−1:

1 Random number generation (RNG) of independent Normal variables Gi

2 Stock price actualization S itk+1

= S itk

(1 + rg (tk )δtδt + σloc (S itk, tk )δtGi )

3 If tk+1 = Tl with l ∈ {1, ...,M}, I iTl= I iTl−1

+ (S iTl< B)

At tNt Compute the payoff X i then average

Lokman (UPMC, LPMA) GTC Europe Lab 11 / 28

Monte Carlo and OpenACC parallelization

P. L’Ecuyer CMRG on GPU to generate uniformlydistributed random variables

General Form oflinear RNGs

Without loss of generality:

Xn = (AXn−1 + C) mod(m) = (A : C)

Xn−1. . .1

mod(m) (3)

Parallel-RNG fromPeriod Splitting of

One RNG* Pierre L’Ecuyer proposed a very efficient RNG (1996) which is a CMRG

on 32 bits: Combination of two Multiple Recursive Generator (MRG) withlag = 3 for each MRG.

* Very long period ∼ 2185

xn = (a1xn−1 + a2xn−2 + a3xn−3)mod(m)

Pre-computations We launch as many parallel RNGs as the number of paths

Use We prefer local variables to store the RNG’s state vector

Lokman (UPMC, LPMA) GTC Europe Lab 12 / 28

Monte Carlo and OpenACC parallelization

Some OpenACC pragmas: #pragma acc clause

global variables clause = declare device_resident, applied to rg , Kg , Tg and Cg . Thesearrays are known by all GPU functions

copying to GPU clause = enter data copyin copies the values to the GPU

embarrassinglyparallel

clause = parallel loop present (+reduction(operation,arguments))making parallel the execution of the loop without data movement.reduction(operation,arguments) reduces all the private copies ofarguments into one final result using operation

Lokman (UPMC, LPMA) GTC Europe Lab 13 / 28

Monte Carlo and CUDA parallelization

Plan

Introduction

Local volatilitychallenges GPUs

Monte Carlo andOpenACC

parallelization

Monte Carlo andCUDA parallelization

PDE formulation &Crank-Nicolson

scheme

PDE simulation andCUDA parallelization

Lokman (UPMC, LPMA) GTC Europe Lab 14 / 28

Monte Carlo and CUDA parallelization

GPU architecture

StreamingMultiprocessor

Registers

L1VCache

Constant

Texture

Shared

...L2VCache

{{{

GlobalVMemory{V

V

V

V

V

V...

MemoryVclockedVatVtheVprocessingVrate

ProcessingVunits

CachedVmemory

GPUVRAM

StreamingMultiprocessor

Registers

L1VCache

Constant

Texture

Shared

StreamingMultiprocessor

Registers

L1VCache

Constant

Texture

Shared

Hardware softwareequivalence I Streaming processor: Executes threads

I Streaming multiprocessor: Executes blocks

Built-in variables Known within functions excuted on GPU: threadIdx.x, blockIdx.x,blockDim.x, gridDim.x

Global memory GPU RAM, allocated thanks to cudaMalloc. Transfer values fromGPU/CPU to CPU/GPU thanks to cudaMemcpy

Constant memory Read only, declared globally. Data transfer with cudaMemcpyToSymbol

Shared Declared in the kernelLokman (UPMC, LPMA) GTC Europe Lab 15 / 28

Monte Carlo and CUDA parallelization

Function declaration and calling

Standard Cfunctions

The same as for C or C++ programming

Device functionsI Called by the GPU and executed on the GPUI Declared as

__device__ void myDivFun (...) { ...; }__device__ float myDivFun (...) { ...; }

I Called simply by myDivFun(...) but only within other device functionsor kernels

Kernel functionsI Called by the CPU and executed on the GPUI Declared as __global__ void myKernel (...) { ...; }I Called standardly by

myKernel<<<numBlocks, threadsPerBlock>>>(...);wherenumBlocks should take into account the number of multiprocessorsthreadsPerBlock should be equal to 128, 256, 512 or 1024

I Dynamic parallelism: kernels can be called within kernels by the GPU andexecuted on the GPU

Lokman (UPMC, LPMA) GTC Europe Lab 16 / 28

Monte Carlo and CUDA parallelization

Adaptation steps

Extension Change it from MC.cpp and rng.cpp to MC.cu and rng.cu

__constant__ Declare Tg , Kg , rg and Cg using constant memory

Copying Copy the values Tg , Kg , rg , Cg and CMRG on the GPU

__device__ Except main, MC, VarMalloc, FreeVar and parameters, define all otherfunctions in MC.cu using __device__

Parallel Declare MC function as a kernel using __global__ + Replace theMonte Carlo loop using threadIdx.x, blockDim.x, blockIdx.x in MC kernel

Reduction Perform the reduction using __shared__ memory in kernel MC andatomicAdd function

Lokman (UPMC, LPMA) GTC Europe Lab 17 / 28

Monte Carlo and CUDA parallelization

GPU timer

float Tim;cudaEvent_t start, stop;cudaEventCreate(&start);cudaEventCreate(&stop);cudaEventRecord(start,0);

/*******************************************************

To compute the execution time of this part of the code

*******************************************************/

cudaEventRecord(stop,0);cudaEventSynchronize(stop);cudaEventElapsedTime(&Tim, start, stop);cudaEventDestroy(start);cudaEventDestroy(stop);

Lokman (UPMC, LPMA) GTC Europe Lab 18 / 28

PDE formulation & Crank-Nicolson scheme

Plan

Introduction

Local volatilitychallenges GPUs

Monte Carlo andOpenACC

parallelization

Monte Carlo andCUDA parallelization

PDE formulation &Crank-Nicolson

scheme

PDE simulation andCUDA parallelization

Lokman (UPMC, LPMA) GTC Europe Lab 19 / 28

PDE formulation & Crank-Nicolson scheme

PDE, u(t, x , j) with respect to x , Black & Scholes PDE

The price F (t, x , j) = e−∫ Tt rg (u)duE(X

∣∣St = x , It = j), X = (ST − K)+1{IT∈[P1,P2]} (4)

For u(t, x , j) = e∫ Tt rg (u)duF (t, ex , j), we equivalently solve u PDE given by

12σ2loc (x , t)

∂2u

∂x2 (t, x , j) + µ(x , t)∂u

∂x(t, x , j) = −

∂u

∂t(t, x , j) (5)

µ(x , t) = rg (t)−σ2loc (x , t)

2, (x , t, j) ∈]0,max(Kg )]× ]Tk ,Tk+1[× [0,P2]

with T0 = 0 and TM+1 = T

Final and boundaryconditions

u(T , x , j) = max(ex − K , 0) for any (x , j) and heuristicallyu(t, log[min(Kg)], j) = pmin = 0 andu(t, log[max(Kg)], j) = pmax = max(Kg)− K

Crank-Nicolsonscheme

Denoting uk,i,j = u(tk , xi , j), σ = σloc (xi , tk ) and µ = µ(xi , tk )

quuk,i+1 + qmuk,i + qduk,i−1 = puuk+1,i+1 + pmuk+1,i + pduk+1,i−1

qu = −σ2∆t

4∆x2 −µ∆t

4∆x, qm = 1 +

σ2∆t

2∆x2 , qd = −σ2∆t

4∆x2 +µ∆t

4∆x

pu =σ2∆t

4∆x2 +µ∆t

4∆x, pm = 1−

σ2∆t

2∆x2 , pd =σ2∆t

4∆x2 −µ∆t

4∆x

Lokman (UPMC, LPMA) GTC Europe Lab 20 / 28

PDE formulation & Crank-Nicolson scheme

PDE, u(t, x , j) with respect to j , barrier condition

ut(x , j) = E

(ST − K)+1{∑Mi=1 1{STi <B}∈[P1,P2]

}∣∣∣St = x ,∑Ti≤t

1{STi<B} = j

t ∈ [TM ,T [ ut(x , j) = E[(ST − K)+|St = x]

t ∈ [TM−1,TM ] ut(x , j) = E[(ST − K)+1{STM≥B} | St = x]1{j=P2}

+ E[(ST − K)+1{STM<B} | St = x]1{j=P1−1}

+ E[(ST − K)+|St = x]1{j∈[P1,P2−1]}

[TM−k−1,TM−k ]k = M − 1, ..., 1

(T0 = 0)

ut(x , j) = E[uTM−k(STM−k

,P2)1{STM−k≥B} | St = x]1{j=P2}

+ E[uTM−k(STM−k

, p1k )1{STM−k

<B} | St = x]1{j=p1k−1}

+ E

[uTM−k

(STM−k, j)1{STM−k

≥B}

+uTM−k(STM−k

, j + 1)1{STM−k<B}

∣∣∣St = x

]1{j∈[p1

k,P2−1]}

with p1k = max(P1 − k, 0)

Lokman (UPMC, LPMA) GTC Europe Lab 21 / 28

PDE formulation & Crank-Nicolson scheme

Lokman (UPMC, LPMA) GTC Europe Lab 22 / 28

PDE simulation and CUDA parallelization

Plan

Introduction

Local volatilitychallenges GPUs

Monte Carlo andOpenACC

parallelization

Monte Carlo andCUDA parallelization

PDE formulation &Crank-Nicolson

scheme

PDE simulation andCUDA parallelization

Lokman (UPMC, LPMA) GTC Europe Lab 23 / 28

PDE simulation and CUDA parallelization

Solve tridiagonal system for the implicit part

T =

d1 c1

a2 d2 c2 0

a3 d3. . .

. . .. . .

. . .

0. . .

. . . cn−1an dn

.

Cyclic Reduction

e1 e7e5

z2 z8z4 z6

Stepg1:gForwardgreductiongtogag4-unknowngsystemginvolvinggz2,gz4,gz6gandgz8

Stepg2:gForwardgreductiongtoag2-unknowngsystemginvolvinggz4gandgz8

Stepg3:gSolveg2-unknowngsystem

Stepg4:gBackwardgsubstitutiongtosolvegthegrestg2gunknowns

Stepg5:gBackwardgsubstitutiongtosolvegthegrestg4gunknowns

A A A AA AA A

AAA

AAAA

AA

AA

AA

A AAA AAA

AA

AA

A

A

AA

e3

z3z1 z5 z7

e1 e2 e3 e4 e5 e6 e7 e8

e'8e'6e'4e'2

e''4 e''8

e'6e'2 z4 z8

z2 z4 z6 z8

Lokman (UPMC, LPMA) GTC Europe Lab 24 / 28

PDE simulation and CUDA parallelization

d1 c1c1 d2 c2

c2 d3 c3c3 d4 c4

c4 d5 c5c5 d6 c6

c6 d7

(R)−→

d ′1 0 c ′20 d ′2 0 c ′3c ′2 0 d ′3 0 c ′4

c ′3 0 d ′4 0 c ′5c ′4 0 d ′5 0 c ′6

c ′5 0 d ′6 0c ′6 0 d ′7

(P)−→

d ′1 c ′2c ′2 d ′3 c ′4

c ′4 d ′5 c ′6c ′6 d ′7 0

0 d ′2 c ′3c ′3 d ′4 c ′5

c ′5 d ′6

(R)−→

d ′′1 0 c ′′20 d ′′3 0 c ′′4c ′′2 0 d ′′5 0

c ′′4 0 d ′′7 00 d ′′2 0 c ′′3

0 d ′′4 0c ′′3 0 d ′′6

(P)−→

d ′′1 c ′′2c ′′2 d ′′5 0

0 d ′′3 c ′′4c ′′4 d ′7 0

0 d ′′2 c ′′3c ′′3 d ′′6 0

0 d ′′4

Lokman (UPMC, LPMA) GTC Europe Lab 25 / 28

PDE simulation and CUDA parallelization

Adaptation steps

Extension Change it from PDE.cpp to PDE.cu

__constant__ Declare Tg , Kg , rg and Cg using constant memory

Copying Copy the values Tg , Kg , rg , Cg on the GPU

__device__ Except for main, PDE_Diff, Out2In, VarMalloc, FreeVar and parameters,define all other functions in PDE.cu using __device__

Registers Compare the CPU version of PCR to the GPU version given to you.Remark the large use of shared memory and registers.

Parallel Declare PDE_Diff and Out2In functions as kernels using __global__ +Replace loops related to St by threadIdx.x and loops related to It byblockIdx.x in PDE_Diff and Out2In kernels as well as in Exp devicefunction.

Lokman (UPMC, LPMA) GTC Europe Lab 26 / 28

PDE simulation and CUDA parallelization

GPU timer

float Tim;cudaEvent_t start, stop;cudaEventCreate(&start);cudaEventCreate(&stop);cudaEventRecord(start,0);

/*******************************************************

To compute the execution time of this part of the code

*******************************************************/

cudaEventRecord(stop,0);cudaEventSynchronize(stop);cudaEventElapsedTime(&Tim, start, stop);cudaEventDestroy(start);cudaEventDestroy(stop);

Lokman (UPMC, LPMA) GTC Europe Lab 27 / 28

PDE simulation and CUDA parallelization

ReferencesI L. A. Abbas-Turki, S. Vialle, B. Lapeyre, and P. Mercier. “Pricing

derivatives on graphics processing units using Monte Carlo simulation”.Concurrency and Computation: Practice and Experience, vol. 26(9), pp.1679–1697, 2014.

I L. A. Abbas-Turki and Stef Graillat, “Resolving small random symmetriclinear systems on graphics processing units”, The Journal ofSupercomputing 73(4), pp. 1360–1386, 2017.

I L. B. G. Andersen and R. Brotherton-Ratcliffe, “The equity optionvolatility smile: an implicit finite difference approach”, Journal ofComputational Finance, vol. 1(2), pp. 5–37, 1997.

I J. Gatheral and A. Jacquier, “Arbitrage-free SVI volatility surfaces”,Quantitative Finance, vol. 14(1), pp. 59–71, 2013.

I P. L’Ecuyer, “Combined multiple recursive random number generators”.Operations Research, vol. 44(5), pp. 816–822, 1996.

Lokman (UPMC, LPMA) GTC Europe Lab 28 / 28