Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Unveiling Cellular Mechanisms Using GPU-based Sparse Linear Algebra
Maggioni Marco (CS, PhD Research Assistant)!Tanya Berger-Wolf (CS, Associate Professor)!
Jie Liang (BIOE, Professor)!
Outline ¡ Motivations
¡ Background ¡ Chemical Master Equation
¡ Working example
¡ Linear algebra methods ¡ Steady-state
¡ Transient dynamic
¡ GPU-based implementation ¡ Domain-specific optimization strategies ¡ Warp-grained ELL format
¡ Experimental results and conclusions
1
Motivations ¡ The Chemical Master Equation provides a fundamental stochastic framework to
study biochemical reaction networks[1]
¡ Steady-state and dynamics of the system can be derived by using linear algebra methods
¡ GPUs can deal with the computational demand ¡ Solve the same reaction network under different condition
¡ An exponential growth in the number of framework microstate
2
¡ [1] Jie Liang and Hong Qian, “Computational Cellular Dynamics Based on the Chemical Master Equation: A Challenge for Understanding Complexity”, Journal of Computer Science and Technology, 2010
Outline ¡ Motivations
¡ Background ¡ Chemical Master Equation
¡ Working example
¡ Linear algebra methods ¡ Steady-state
¡ Transient dynamic
¡ GPU-based implementation ¡ Domain-specific optimization strategies ¡ Warp-grained ELL format
¡ Experimental results and conclusions
3
Background ¡ In the theory of the Chemical Master Equation, the dynamics of a biochemical reaction
system, in a small volume, is represented by a discrete-state, continuous-time Markov process
¡ Microstates defined by molecular species count
¡ Transition from xj to xi defined by a rate
¡ Landscape probability over microstates
4
!
x = {x1,x2,...,xm}"Nm
!
Ak (xi,x j ) = rkxici
"
# $
%
& '
i=1
m
(¡ rk is the k-reaction rate
¡ ci is a species involved in the reaction
Background ¡ The discrete Chemical Master Equation describes the change of probability of each
microstate due to reactions
The previous can be written in matrix form by defining
where P(t) is the landscape probability at time t and A is the reaction rate matrix
5
!
dP(x, t)dt
= A(x,x ')P(x ',t) " A(x ',x)P(x, t)[ ]x#x'$
!
A(x,x) = " A(x ',x)x#x'$
%
& '
(
) *
!
dP(t)dt
= AP(t)
ingoing
outgoing
Background
¡ Matrix A is a infinitesimal generator of a Markov model
¡ Steady state as solution of a system of linear equations
¡ Dynamics as a matrix exponential
6
!
dP(t)dt
= 0"AP = 0
!
dP(t)dt
= AP "P(t) = eAtP(0)
Insight about the macroscopic states of the system
Stochastic simulation to study the dynamics of the biological mechanisms
¡ Steady-state probability landscape of toggle network
Working example
7
Insight about the macroscopic states of the system
Toggle network
Working example ¡ Dynamics of the landscape probability of toggle network
8
!
limt"#
eAtP(0)[ ] = P
Stochastic simulation to study the dynamics of the biological mechanisms Initial landscape is uniformly distributed, then converge to steady state
Outline ¡ Motivations
¡ Background ¡ Chemical Master Equation
¡ Working example
¡ Linear algebra methods ¡ Steady-state
¡ Transient dynamic
¡ GPU-based implementation ¡ Domain-specific optimization strategies ¡ Warp-grained ELL format
¡ Experimental results and conclusions
9
Steady-state ¡ Steady-state equivalent to solve a system of linear equations
¡ Jacoby iteration
¡ Easy to parallelize (very similar to SpMV)
¡ Numerically well-suited for ill-conditioned reaction rate matrices
¡ Iteration derived from simple decomposition
¡ Iterative convergence (spectral radius ρ(M)<1)
10
!
Ax = 0! x = "D"1(L +U)#$ %&x! xk+1 =Mxk = x +M! k
Steady-state ¡ Each row is an independent computation
¡ Trade-off between convergence and parallelism
¡ Normalized stopping criterion based on residual vector r =Ax
¡ Also check stagnation and maximum iterations
11
!
xik+1 = !aii
!1 aij x jk
j"i#$
%&&
'
())
r!
A!x
!
" !
Transient dynamic ¡ Matrix exponential is a series of infinite terms
¡ Create a dense matrix (not feasible “fill-in”)
¡ Krylov-based approximation technique introduced in [2]
¡ Reduced-size projection Hm of matrix A on a Krylov subspace
¡ Compute matrix exponential explicitly with Padè approximant
12
!
P(t) = eAtP0 =1k!k=0
"
# (At)k P0
¡ [2] Roger B. Sidje and William J. Stewart, “A numerical study of large sparse matrix exponentials arising in Markov chains”, Computational Statistics & Data Analysis, 1999
Transient dynamic ¡ Iterative algorithm to calculate
¡ Dynamics is approximated as
13
Basis normalization (unit vector)
Orthogonal basis
Span the next span vector of Krylov subspace
!
eAtP0 " #Vm+1eHm+1te1
Outline ¡ Motivations
¡ Background ¡ Chemical Master Equation
¡ Working example
¡ Linear algebra methods ¡ Steady-state
¡ Transient dynamic
¡ GPU-based implementation ¡ Domain-specific optimization strategies ¡ Warp-grained ELL format
¡ Experimental results and conclusions
14
¡ Well defined structure…
¡ Each row has a bound of the number of nonzeros (limited by reaction in the network)
¡ ELL sparse format[3] well-suited for vector architecture
¡ Two dense n x k arrays
¡ Regular memory access
¡ Coalesced (column-major)
¡ Aligned (padding)
Domain-specific optimization strategies
15
¡ [3] Nathan Bell and Michael Garland, “Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors”, Conference on High Performance Computing Networking, Storage and Analysis, 2009
!
efficiencyELL =nnznk
Domain-specific optimization strategies ¡ Well defined structure
¡ The diagonal is dense since
¡ Diagonal stored as a separate dense vector
16
!
A(x,x) = " A(x ',x)x#x'$
%
& '
(
) *
¡ Reduced column indexing
¡ Readily available for Jacobi iteration
!
xik+1 = "aii
"1 aij x jk
j# i$%
& ' '
(
) * *
Domain-specific optimization strategies ¡ Well-defined structure
¡ Biochemical networks often include reversible reactions
¡ ELL+DIA combined format
17
¡ DFS visit exposes a densely populated band due to long chain of reversible microstates
¡ Readily available from [4], no additional reordering
¡ DIA provides coalesced access to x, but not always aligned
¡ [4] Youfang Cao and Jie Liang, “Optimal enumeration of state space of finitely buffered stochastic molecular networks and exact computation of steady state landscape probability”, BMC Systems Biology, 2008
Warp-grained ELL format ¡ Based on the Sliced ELL [5] sparse format
¡ More efficiency with local ELL data structures
¡ Warp-grained ELL based on warp-sized slices
18
¡ Decouple slice size s (data structure) from block size b (CUDA block)
¡ Efficiency + SM occupancy
¡ Local row rearrangement ¡ Decrease nonzero variability
within warps
¡ Done locally within blocks to preserve cache locality
¡ [5] Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan, “Automatically tuning sparse matrix-vector multiplication for GPU architectures”, HiPEAC, 2010
Outline ¡ Motivations
¡ Background ¡ Chemical Master Equation
¡ Working example
¡ Linear algebra methods ¡ Steady-state
¡ Transient dynamic
¡ GPU-based implementation ¡ Domain-specific optimization strategies ¡ Warp-grained ELL format
¡ Experimental results and conclusions
19
¡ Experimental Setup ¡ Fermi architecture [email protected], 512 CUDA cores, 3GB GDDR5@192GB/s
¡ Quad-socket Opteron [email protected] GHz, 16 cores, 128GB [email protected]/s
¡ Double precision floating-point, 48 KB L1 cache
¡ Benchmarks
Experimental results
20
¡ Sparse MVP with ELL+DIA format
Experimental results
21
¡ Block size b=256 (chosen by exhaustive testing)
¡ 48KB L1 cache gives 6% average improvement over 16 KB
¡ Best speed up with benchmarks with dense diagonal band
¡ Sparse MVP with warp-grained ELL format
Experimental results
22
¡ Warp-grained ELL has a 8% improvement over ELL and a 6% over Sliced ELL
¡ Average memory footprint reduced from 440.98 Mbytes (ELL) to 322.45 Mbytes, slightly better than CSR
¡ Global reordering decrease average performance by 6%
¡ [6] Bor-Yiing Su and Kurt Keutzer, “clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs”, ICS, 2012
¡ Steady state probability (Jacoby solver)
Experimental results
24
¡ Compared to multicore implementation derived from Intel MKL library
¡ Error set to ε=10-8
¡ Transient dynamic probability (Arnoldi method)
Experimental results
25
¡ Speed up on Arnoldi method is slightly lower due to additional vector routines
Conclusions ¡ We have presented a GPU-based implementation of steady state (15.67x)
and transient dynamic (12.60x) for studying biochemical reaction networks
¡ Kepler architecture should give benefit in terms of memory hierarchy (bandwidth and read-only cache)
¡ GPU clusters can virtually extend to any realistic network ¡ A new direction of biological computing, comparable to molecular dynamic
simulation
¡ Work generalizable to any large Markov models
¡ Analysis of the dynamic landscape to identify rare events
26