GPU Applications for Modern Large Scale AssetManagement
GTC 2014 – San Jose, California
Dr. Daniel Egloff
QuantAlea & IncubeAdvisory
March 27, 2014
Outline
Portfolio Construction
Constructing Asset Return Distributions
I) Entropy ApproachLarge Scale Convex Optimization with GPUs
II) Bayesian ApproachParallel Metropolis Hastings on GPUs
Conclusion
Outline
Portfolio Construction
Constructing Asset Return Distributions
I) Entropy ApproachLarge Scale Convex Optimization with GPUs
II) Bayesian ApproachParallel Metropolis Hastings on GPUs
Conclusion
Outline
Portfolio Construction
Constructing Asset Return Distributions
I) Entropy ApproachLarge Scale Convex Optimization with GPUs
II) Bayesian ApproachParallel Metropolis Hastings on GPUs
Conclusion
Outline
Portfolio Construction
Constructing Asset Return Distributions
I) Entropy ApproachLarge Scale Convex Optimization with GPUs
II) Bayesian ApproachParallel Metropolis Hastings on GPUs
Conclusion
Outline
Portfolio Construction
Constructing Asset Return Distributions
I) Entropy ApproachLarge Scale Convex Optimization with GPUs
II) Bayesian ApproachParallel Metropolis Hastings on GPUs
Conclusion
Portfolio Construction
I Markowitz mean variance portfolio optimizationI Optimally balance risk and performance
w∗λ = arg maxw , 1Tw=b
Uλ(w), Uλ(w) = µ>w − λw>Σw (1)
I Risk budgetingI Construct portfolio so that risk contributions satisfy
RCi (w) = bi for given risk budgets biI For volatility risk measure we obtain
RCi (w) = wi(Σw)i√w>Σw
(2)
I For both approaches we need means µ and covariances Σ ofasset returns
Asset Return Distributions
Two steps
1. Prior asset return distribution from historical data or inverseoptimization principle
2. Posterior asset return distribution should reflect expert viewsand beliefs
I Views are distortions of a prior distributionI Views translate into constraintsI Linear constraints on mean return pioneered by Black and
Litterman
We discuss two approaches
I Entropy based approach
I Bayesian methods
Entropy Approach
I Minimum discrimination principle (Kullback and Leibler 1951)infers a posterior distribution such that it
I Satisfies a set of moments constraintsI Is as hard as possible to differentiate from given prior
distribution
I Minimum discrimination principle suitable methodology toincorporate constraints on a prior distribution which are
I LinearI QuadraticI Convex
Nonparametric Setup
I Sample the probability distributionsI Discrete sample space X = {x1, . . . , xm}I Probabilities elements of the standard simplex
p = (p1, . . . , pm)> ∈ ∆m ⊂ Rm
I Implement views as distortions of prior probabilities piI Express views as feature functions fj : X → R, j = 1, . . . , n
I Identify feature vector f = (f1, . . . , fn)> : X → Rn with matrixF = (fji ) ∈ Rn×m, with fji = fj(xi )
Linear Constraints on Features
I Expectation of feature f : X → R w.r.t measure q ∈ ∆m is
Eq[f ] =∑
qi f (xi ) = q>f
I Minimum discrimination principle defines probability p∗ as
p∗ = arg minq∈∆m
q> (log(q)− log(p)) (9)
subject to constraints
Eq[fj ] ≤ cj , j = 1, . . . , n, Eq[fj ] = cj , j = 1, . . . , n. (10)
Optimization Problem
I Expectation views (10) lead to linear constraints
p∗ = arg minq ∈ Rm
+, 1>q = 1
Fq ≤ c , F q = c
q> (log(q)− log(p)) (11)
I Need large number of samples for accurate representation ofdistributions with m ' 106
I With linear constraints: Lagrange duality to solve the optimalsolution
I Strong duality, i.e. no duality gapI Converting problem of dimension m to much smaller problems
of dimension n + n to find optimal Lagrange multipliers
Convex Constraints
I How to extend to variance and covariance constraints?
I Lower variance constraint q>f 2 − (q>f )2 ≥ c convex but dualproblem is ill-conditioned
I Upper variance constraint q>f 2− (q>f )2 ≤ c not even convex
I Approximation by linearization q>f 2 ≤ c + (p>f )2 to get asecond moment constraint
I But .... linearized version often not accurate enough
Large Scale Convex Optimization
I For convex constraints solve directly the primal problem
I Several accelerated first order methods (FOM) for convexoptimization algorithms have been developed
I Modern algorithms adjust to the problem’s geometryI Examples
I Mirror descent (MD) algorithmI Level bundle methodsI Bundle mirror descentI Many different variations
Exploiting Geometry
I Gradient methods regularized with quadratic proximity term
y = arg minx∈X
{γ〈ξ,∇f (x)〉+ c‖ξ − x‖2}
I Replace ‖ · ‖2 with some distance like function D(·, ·) thatbetter exploits the geometry of X and is simple to calculate
I Candidates: Bregman type distance based on kernelω : X → R
D(x , y) = Dω(x , y) ≡ ω(x)− ω(y)− 〈x − y , ω′(y)〉
Bregman Distances
I Using ω(u) = 12‖u‖
2 gives Euclidian norm distance
Dω(x , y) =1
2‖x − y‖2
I The entropy ω(x) = x> log(x) on X = ∆m leads to
Dω(x , y) = x> log
(x
y
)which is 1-strongly convex with respect to the L1 norm
Dω(x , y) ≥ 1
2‖x − y‖2
1 ∀x , y ∈ ∆m
Constrained Mirror Descent Setup
I Convex problem minx∈X f0(x) subject to f1(x) ≤ 0
I X ⊂ E , closed convex subset of f.d. vector space E
I ‖ · ‖ norm on E , ‖ · ‖∗ dual norm on E ∗
I Duality pairing 〈·, ·〉 : E ∗ × E → RI Bregman distance Dω(x , y) from kernel ω : X → RI Prox mapping Proxx : E ∗ → X ◦
Proxx(ξ) = arg minu∈X
{〈ξ, u〉+ Dω(u, x)}
Constrained Mirror Descent
I Mirror descent with constraints is a simple FOM
I Initial search point x1 = xω = arg minx∈X ω(x)I For t = 1, . . . ,N do
I If f1(xt) ≤ tol then i(t) = 0 else i(t) = 1I Select step size γtI Update search point xt+1 = Proxxt (γt f
′i(t)(xt))
I Approximate solution after N steps
xN = arg minx∈{xt |i(t)=0}
f0(x)
I Possible choice tol = γ‖f ′1(xt)‖∗ and γt = γ‖f ′i(t)(xt)‖−1∗
Geometry on Simplex
I Optimization problem (11) formulated on simplex X = ∆m
I Norm ‖ · ‖1 with dual norm ‖ · ‖∞I Prox mapping allows analytical expression
Proxx(ξ) =(x>e−ξ
)−1x • e−ξ
I Initial point xω = n−11n the uniform distribution
Constrained Mirror Descent on GPU
I GPU implementation of constrained mirror descent algorithmsfor very large n
I Parallel coordinate-wise calculation of gradients for objectivefunction and constraints
I Vector reductions to calculate objective value, norms andscalar products
I Parallel coordinate-wise calculation of prox vector, searchpoints and execution of gradient step
I Gradient steps can be further parallelized by coordinateaggregation or randomization methods
Schematic Gradient Step
Gradient step
Test convergence
Calculate objective value
Calculate constraints and objective values and gradients
Calculate dual norm of gradients and update approximate solution
Calculate unnormalized prox vector
Calculate prox normalization factor
Normalize prox vector and update search point
No
Parallel reduction algorithm
(2 kernels)
Parallel vector transform
kernel
Tuning with Kernel Fusion
I Usually kernel launch time (≈ 50µs) not an issue
I For algorithms with many iterations, this becomes crucial
I Use kernel fusion technique to reduce number of kernel callsper gradient step
I Minimize number of arguments to pass to each kernel
Bayesian Approach
I Bayesian methods treat the model parameters as randomvariables
I Views go into the prior distributions of the model parameters
I Historical data is used to update the prior distribution
I Not as flexible as entropy approach
I Does not rely on optimization hence less fragile
I Work horse is a parallel Metropolis Hastings Markov chainMonte Carlo sampler
Parallel Metropolis Hastings
I Parallelization of Metropolis Hastings through multipleparallel chains
I Further parallelization through pre-fetching
I Require efficient parallel random number generation andbranch free methods to sample from proposal distribution
Parallel Metropolis Hastings
GPU Metropolis Hastings sampler generation from generic targetand proposal distribution expressions with F# and Alea.cuBase
let inline mhsamples (targetpdf : Expr<’T -> ’T>)
(proppdf : Expr<’T -> ’T -> ’T -> ’T>)
(proprnd : Expr<’T -> ’T -> ’T -> ’T>) = cuda {
let! proppdf = proppdf |> Compiler.DefineInlineFunction
let! proprnd = proprnd |> Compiler.DefineInlineFunction
let! targetpdf = targetpdf |> Compiler.DefineInlineFunction
let! kernel =
<@ fun (steps:int) (scale:’T) (initial:deviceptr<’T>)
(numbers:deviceptr<’T>) (result:deviceptr<’T>) ->
let idx = blockIdx.x * blockDim.x + threadIdx.x
let nchains = gridDim.x * blockDim.x
. . .
Parallel Metropolis Hastings
Expressions for target distribution and proposal distribution arecompiled directly into kernel
/// Dynamically generate expressions for MH at runtime
let targetpdf =
<@ fun x ->
exp(-x*x)*(2G + sin(5G*x) + sin(2G*x)) @>
let proppdf =
<@ fun sigma x y ->
exp(-(y - x)*(y - x)/(2G*sigma*sigma)) / (sigma*__sqrt2pi()) @>
let proprng =
<@ fun sigma x u -> x + sigma*u @>
/// Compiler integrated into F#, callable through API
let mh = mhsamples targetpdf proppdf proprng |> Util.load Worker.Default
Example
−8 −6 −4 −2 0 2 4 6 80
0.5
1
1.5
2
2.5x 10
4
Figure : Sample histogram fortarget densitye−x
2
(2 + sin(5x) + sin(2x))
−10 −8 −6 −4 −2 0 2 4 6 80
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Figure : Kernel densitysmoothed sample pdf fore−x
2
(2 + sin(5x) + sin(2x))
Conclusion
I We have shown two applications of GPUs for quantitativeasset management
I More applications in strategy calibration, back-testing andMIP
I Large scale convex optimization occurs in other fields such asI Machine learning such as e.g. deep believe network training in
a big data contextI Network analysisI Truss topology designI Compressed sensing
I Markov chain Monte Carlo methods require flexible ways tospecify various parts such as target pdf, the proposaldistribution and a dynamic way to derive tuning parameters,for which our Alea.cuBase F# GPU compiler is ideal