Multi-core programming talk for weekly biostat seminar

Using multi-corealgorithms to

speed upoptimization

Gary K. ChenBiostat Noon

Seminar

Introduction tohigh-performancecomputing

Concepts

Example 1: HiddenMarkov ModelTraining

Example 2:RegularizedLogistic Regression

Closing remarks

Using multi-core algorithms tospeed up optimization

Gary K. ChenBiostat Noon Seminar

March 23, 2011




Seminar


Concepts



Closing remarks

An outline

Introduction to high-performance computing

Concepts

Example 1: Hidden Markov Model Training

Example 2: Regularized Logistic Regression

Closing remarks




Seminar


Concepts



Closing remarks

CPUs are not getting any faster

I Heat and power are the sole obstaclesI According to Intel: underclock a single core

by 20 percent and you save half the powerwhile sacrificing only 13 percent of theperformance.

I Implication? Two cores at the same powerhave 73% more performance

I (100− 13) ∗ 2/100




Seminar


Concepts



Closing remarks

1. High performance computingclusters

I Coarse-grained, aka “embararassinglyparallel”, problems

I 1. Launch multiple instances of the programI 2. Compute summary statistics across log

files

I ExamplesI Monte Carlo simulations (power/specificity),

GWAS scans, imputation, etc.

I RemarksI Pros: maximizes throughput (CPUs kept

busy), gentle learning curveI Cons: Doesn’t address some interesting

computational problems




Seminar


Concepts



Closing remarks

Cluster Resource Example

I HPCC at USCI 94 teraflop clusterI 1,980 simultaneous processes running on

main queueI Jobs are asynchronous; can start and end in

any order

I Portable Batch SystemI Simply prepend some headers in your shell

script, describing how much memory youwant, how long your job will run, etc.




Seminar


Concepts



Closing remarks

2. High performance computingclusters

I Tightly-coupled parallel programsI Message Passing Interface

I 1. Programs are distributed across multiplephysical hosts

I 2. Each program executes the exact samecode

I 3. All processes can be synchronized atstrategic points

I RemarksI Pro: Can run interesting algorithms like

parallel tempered MCMCI Con: Developer is responsible for establishing

a communication protocol




Seminar


Concepts



Closing remarks

Exploiting multiple-core processors

I Fine-grained parallelismI Suggests a much higher degree of

inter-dependence between each processI A “master” process executes majority of

code base. “Slave” processes are invoked toease bottlenecks.

I We hope to minimize the time spent in themaster process

I Some Bayesian algorithms stand to benefit




Seminar


Concepts



Closing remarks

Amdahl’s Law

1(1−P)+ P

N




Seminar


Concepts



Closing remarks

Heterogeneous Computing




Seminar


Concepts



Closing remarks

Multi-core programming

I aka data-parallel programmingI Built in to common compilers (e.g. gcc)

I Very easy to get started!I SSE or Streaming SIMD Extensions: each

core can do vector operationsI OpenMP: parallel processing across multiple

coresI e.g. simply insert ”pragma omp for” directive

and compile with gcc!

I CUDA/OpenCLI CUDA is a proprietary C-based language

endorsed by nVidiaI OpenCL: standards based implementation

backed by the Khronos Group




Seminar


Concepts



Closing remarks

OpenCL and CUDA

I CUDAI Powerful libraries available to enrich

productivityI Thrust: C++ generics, cuBLAS: Level 1 and

2 parallel BLASI Supported only on nVidia GPU devices

I OpenCLI Compatible with nVidia and ATI GPU

devices, as well as AMD/Intel CPUsI Lags behind CUDA in libraries and toolsI Good to work with, given ATI hardware

currently leads in value




Seminar


Concepts



Closing remarks

A $60 HPC under your desk




Seminar


Concepts



Closing remarks

An outline


Concepts



Closing remarks




Seminar


Concepts



Closing remarks

Threads and threadblocks

I Threads:I Perform a very limited function, but do all

the heavy liftingI Are extremely lightweight, so you’ll want to

launch thousands

I Threadblocks:I Developer assigns threads that can cooperate

on a common task into threadblocksI Threadblocks cannot communicate with one

another and run in any order(asynchronously)




Seminar


Concepts



Closing remarks

Thread organization




Seminar


Concepts



Closing remarks

Memory hierarchy




Seminar


Concepts



Closing remarks

Kernels

I Warps/Wavefront:I Describes an atomic set of threads (32 for

nVidia, 64 for ATI)I Instructions are executed in lock step across

the set, each thread processing a distinctdata element

I Developer responsible for synchronizingacross warps

I Kernels:I Code that developer writes, which can

execute on a SIMD deviceI Essentially C functions




Seminar


Concepts



Closing remarks

An outline


Concepts



Closing remarks




Seminar


Concepts



Closing remarks

I Hidden Markov ModelsI A staple in machine learning.I Many applications in statistical genetics,

including imputation of untyped genotypes,local ancestry, sequence alignment (e.g.protein family scoring)




Seminar


Concepts



Closing remarks

Application to cancer tumor data

I Extending PennCNVI Tissues are assumed to be a mixture of

tumor/normal cellsI Tumors are assumed to be heterogeneous in

CN across cells, implying fractional copynumber states

I PennCNV defines 6 hidden integer states fornormal cells and does not infer allelic state

I We can make more precise estimates of bothcopy numbers and allelic state in tumors withlittle sacrifice in performance

I Copy Num: z = (1-α)znormal + α ztumor

I z is fractional, whereas ztumor =I(z<=2)floor(z) + I(z>2)ceil(z)




Seminar


Concepts



Closing remarks

State Spacestate CNfrac BACnormal CNtumor BACtumor0 2 0 2 01 2 1 2 12 2 2 2 23 0 0 0 04 0 1 0 05 0 2 0 06 0.5 0 0 07 0.5 1 0 08 0.5 2 0 09 1 0 1 010 1 1 1 011 1 1 1 112 1 2 1 113 1.5 0 1 014 1.5 1 1 015 1.5 1 1 116 1.5 2 1 117 2.5 0 3 018 2.5 1 3 119 2.5 1 3 220 2.5 2 3 321 3 0 4 022 3 1 4 123 3 1 4 224 3 1 4 325 3 2 4 426 3.5 0 4 027 3.5 1 4 128 3.5 1 4 229 3.5 1 4 330 3.5 2 4 4




Seminar


Concepts



Closing remarks

Training a Hidden Markov Model

I Objective: Infer the probabilities oftransitioning between any pair of states

I Apply forward-backward and Baum-Welchalgorithms

I A special case of theExpectation-Maximization (or generally,MM) family of algorithms

I Expectation step: forward-backwardcomputes posterior probs based on estimatedparameters

I Maximization: Baum-Welch empiricallyestimates parameters by averaging acrossobservations




Seminar


Concepts



Closing remarks

I Forward algorithmI We compute the probability vector at

observation t: f0:t = f0:t−1TOt

I Each state (element of the m-state vector)can independently compute a sum-product

I Threadblocks map to statesI Threads calculate products in parallel,

followed by a log2(m) addition reduction




Seminar


Concepts



Closing remarks

Gridblock of threadblocks




Seminar


Concepts



Closing remarks

Speedups

I We implement 8 kernels. Examples:I Re-scaling transition matrix (for SNP

spacing)I Serial: O(2nm2); Parallel: O(n)

I Forward backwardI Serial: O(2nm2); Parallel: O(nlog2(m))

I Normalizing constant (Baum-Welch)I Serial: O(nm); Parallel: O(log2(n))

I MLE of transition matrix (Baum-Welch)I Serial: O(nm2); Parallel: O(n)




Seminar


Concepts



Closing remarks

Run time comparison

Table: 1 iteration of HMM training on Chr 1 (41,263SNPs)

states CPU GPU fold-speedup128 9.5m 37s 15x512 2h 35m 1m 44s 108x




Seminar


Concepts



Closing remarks

An outline


Concepts



Closing remarks




Seminar


Concepts



Closing remarks

Regularized Regression

I Variable SelectionI For tractability, most GWAS analyses entail

separate univariate tests of each variable(e.g. SNP, GxG, GxE).

I However, it is preferable to model allvariables simultaneously to tease outcorrelated variables

I This is problematic when p < n. Parametersare unestimable, matrix inversion becomescomputationally intractable




Seminar


Concepts



Closing remarks


I The LASSO method (Tibshirani, 1996)I Seeded a cottage industry of related methodsI e.g. Group LASSO, Elastic Net, MCP, NEG,

Overlap LASSO, Graph LASSOI Fundamentally solves variable selection

problem by introducing an L1 norm to invokesparsity

I Limitations: Do not provide a mechanismfor hypothesis testing (e.g p-values)




Seminar


Concepts



Closing remarks


I Bayesian methodsI Posterior inferences on βI e.g.: Bayesian LASSO, Bayesian Elastic Net,I Highly computational. Scaling up to genome

wide scale is not obviousI MCMC is inherently serial, best option is to

speed up the sampling chainI Proposal: Implement key bottle neck on the

GPU: fitting βLASSO to the data




Seminar


Concepts



Closing remarks

Optimization

I For binomial logistic regression:I L(β) =

∑ni=1[yi logpi + (1− yi)log(1− pi)]

I pi = eµ+xti β

1+eµ+xt

iβ

I 5L(β) =∑n

i=1[yi − pi(β)]xi

I −d2L(β) =∑n

i=1 pi(β)[1− pi(β)]xixti

I For *penalized* regression:I f (β) = L(β)− λ

∑pj=1 |βj |

I Find global maximum by applying NewtonRaphson one variable at a time.

I βm+1j = βm

j −Pn

i=1[yi−pi (βm)]xi−λsgn(βm

j )Pni=1 pi (βm)[1−pi (βm)]xix

ti




Seminar


Concepts



Closing remarks

Overview of algorithmI Newton-Raphson kernel

I Each threadblock maps to a block of 512subjects (theads) for 1 variable

I Each thread calculates subject’s contributionto gradient and hessian

I Sum (reduction) across 512 subjectsI Sum (reduction) across subject blocks in new

kernel

I Compute log-likelihood change for eachvariable (like above).

I Apply a max operator (log2 reduction) toselect variable with greatest contributionto likelihood.

I Iterate repeatedly until likelihood increaseless than epsilon




Seminar


Concepts



Closing remarks

Gridblock of threadblocks




Seminar


Concepts



Closing remarks

Consideration of datatypes

I Need to compress genotypesI Why? Global memory is scarce, bandwidth is

expensiveI A warp of 32 threads loads 32 words

(containing 512 genotypes) into localmemory




Seminar


Concepts



Closing remarks

Distributed GPU implementation

I For really large dimensions, we can link upan arbitrary number of GPUs

I MPI allows us to spread work across acluster

I Developed on Epigraph: 2 Tesla C2050sI Approach

I MPI master node delegates heavy lifting toslaves across network

I Master node performs fast serial code, suchas sampling from the full conditionallikelihood of any penalty parameter (e.g. λ)

I Network traffic is minimized so slaves mustmaintain up to date copies of data structures




Seminar


Concepts



Closing remarks




Seminar


Concepts



Closing remarks

Evaluation on large dataset

I GWAS dataI 6,806 African American subjects in a case

control study of prostate cancerI 1,047,986 SNPs typed

I Elapsed walltime for 1 LASSO iteration(sweep across all variables)

I 15 minutes on optimized serialimplementation across 2 slave CPUs

I 5.8 seconds on parallel implementation across2 nVidia Tesla C2050 GPU devices

I 155x speed up




Seminar


Concepts



Closing remarks




Seminar


Concepts



Closing remarks

An outline


Concepts



Closing remarks




Seminar


Concepts



Closing remarks

Conclusion

I Multicore programming is not a panaceaI Insufficient parallelism leads to an inferior

implementationI Graph algorithms *generally* do not map

well to SIMD architectures

I Programming EffortI Expect to spend at least 90 % time

debugging a black boxI Is it worth it? Human time > computer

time?I For generic problems (matrix multiplication,

sorting), absolutelyI OpenCL is a bit more verbose than CUDA,

but is more portable




Seminar


Concepts



Closing remarks

Potential Future Work

I Reconstructing Bayesian NetworksI Compute joint probability for each possible

topologyI Code graph as a sparse adjacency matrix

I Approximate Bayesian ComputationI Sample θ from some assumed prior

distributionI Generate a dataset conditional on θI Examine how close fake data is to the real

one




Seminar


Concepts



Closing remarks

Tomorrow’s clusters will requireheterogeneous programming




Seminar


Concepts



Closing remarks

Tianhe-1A

I World’s faster supercomputerI 4.7 petaflops (quadrillion floating point

operations/sec)I 14,336 Xeon CPUs, 7,168 Tesla M2050s

I According to nVidiaI CPU only: 50k CPUs, twice the floor spaceI CPU only: 12 megawatts compared to 4.04

megawattsI $88 million dollars to build, $20 million for

annual energy costs




Seminar


Concepts



Closing remarks

Thanks to

I Kai: Ideas for CNV analysis

I Duncan, Wei: Discussions on LASSO

I Tim, Zack: Access to Epigraph

I Alex, James: Lively HPCdiscussions/debates

Technology

Multi-core programming talk for weekly biostat seminar