Upload
usc
View
546
Download
3
Embed Size (px)
Citation preview
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Using multi-core algorithms tospeed up optimization
Gary K. ChenBiostat Noon Seminar
March 23, 2011
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
An outline
Introduction to high-performance computing
Concepts
Example 1: Hidden Markov Model Training
Example 2: Regularized Logistic Regression
Closing remarks
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
CPUs are not getting any faster
I Heat and power are the sole obstaclesI According to Intel: underclock a single core
by 20 percent and you save half the powerwhile sacrificing only 13 percent of theperformance.
I Implication? Two cores at the same powerhave 73% more performance
I (100− 13) ∗ 2/100
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
1. High performance computingclusters
I Coarse-grained, aka “embararassinglyparallel”, problems
I 1. Launch multiple instances of the programI 2. Compute summary statistics across log
files
I ExamplesI Monte Carlo simulations (power/specificity),
GWAS scans, imputation, etc.
I RemarksI Pros: maximizes throughput (CPUs kept
busy), gentle learning curveI Cons: Doesn’t address some interesting
computational problems
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Cluster Resource Example
I HPCC at USCI 94 teraflop clusterI 1,980 simultaneous processes running on
main queueI Jobs are asynchronous; can start and end in
any order
I Portable Batch SystemI Simply prepend some headers in your shell
script, describing how much memory youwant, how long your job will run, etc.
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
2. High performance computingclusters
I Tightly-coupled parallel programsI Message Passing Interface
I 1. Programs are distributed across multiplephysical hosts
I 2. Each program executes the exact samecode
I 3. All processes can be synchronized atstrategic points
I RemarksI Pro: Can run interesting algorithms like
parallel tempered MCMCI Con: Developer is responsible for establishing
a communication protocol
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Exploiting multiple-core processors
I Fine-grained parallelismI Suggests a much higher degree of
inter-dependence between each processI A “master” process executes majority of
code base. “Slave” processes are invoked toease bottlenecks.
I We hope to minimize the time spent in themaster process
I Some Bayesian algorithms stand to benefit
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Amdahl’s Law
1(1−P)+ P
N
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Heterogeneous Computing
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Multi-core programming
I aka data-parallel programmingI Built in to common compilers (e.g. gcc)
I Very easy to get started!I SSE or Streaming SIMD Extensions: each
core can do vector operationsI OpenMP: parallel processing across multiple
coresI e.g. simply insert ”pragma omp for” directive
and compile with gcc!
I CUDA/OpenCLI CUDA is a proprietary C-based language
endorsed by nVidiaI OpenCL: standards based implementation
backed by the Khronos Group
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
OpenCL and CUDA
I CUDAI Powerful libraries available to enrich
productivityI Thrust: C++ generics, cuBLAS: Level 1 and
2 parallel BLASI Supported only on nVidia GPU devices
I OpenCLI Compatible with nVidia and ATI GPU
devices, as well as AMD/Intel CPUsI Lags behind CUDA in libraries and toolsI Good to work with, given ATI hardware
currently leads in value
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
A $60 HPC under your desk
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
An outline
Introduction to high-performance computing
Concepts
Example 1: Hidden Markov Model Training
Example 2: Regularized Logistic Regression
Closing remarks
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Threads and threadblocks
I Threads:I Perform a very limited function, but do all
the heavy liftingI Are extremely lightweight, so you’ll want to
launch thousands
I Threadblocks:I Developer assigns threads that can cooperate
on a common task into threadblocksI Threadblocks cannot communicate with one
another and run in any order(asynchronously)
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Thread organization
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Memory hierarchy
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Kernels
I Warps/Wavefront:I Describes an atomic set of threads (32 for
nVidia, 64 for ATI)I Instructions are executed in lock step across
the set, each thread processing a distinctdata element
I Developer responsible for synchronizingacross warps
I Kernels:I Code that developer writes, which can
execute on a SIMD deviceI Essentially C functions
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
An outline
Introduction to high-performance computing
Concepts
Example 1: Hidden Markov Model Training
Example 2: Regularized Logistic Regression
Closing remarks
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
I Hidden Markov ModelsI A staple in machine learning.I Many applications in statistical genetics,
including imputation of untyped genotypes,local ancestry, sequence alignment (e.g.protein family scoring)
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Application to cancer tumor data
I Extending PennCNVI Tissues are assumed to be a mixture of
tumor/normal cellsI Tumors are assumed to be heterogeneous in
CN across cells, implying fractional copynumber states
I PennCNV defines 6 hidden integer states fornormal cells and does not infer allelic state
I We can make more precise estimates of bothcopy numbers and allelic state in tumors withlittle sacrifice in performance
I Copy Num: z = (1-α)znormal + α ztumor
I z is fractional, whereas ztumor =I(z<=2)floor(z) + I(z>2)ceil(z)
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
State Spacestate CNfrac BACnormal CNtumor BACtumor0 2 0 2 01 2 1 2 12 2 2 2 23 0 0 0 04 0 1 0 05 0 2 0 06 0.5 0 0 07 0.5 1 0 08 0.5 2 0 09 1 0 1 010 1 1 1 011 1 1 1 112 1 2 1 113 1.5 0 1 014 1.5 1 1 015 1.5 1 1 116 1.5 2 1 117 2.5 0 3 018 2.5 1 3 119 2.5 1 3 220 2.5 2 3 321 3 0 4 022 3 1 4 123 3 1 4 224 3 1 4 325 3 2 4 426 3.5 0 4 027 3.5 1 4 128 3.5 1 4 229 3.5 1 4 330 3.5 2 4 4
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Training a Hidden Markov Model
I Objective: Infer the probabilities oftransitioning between any pair of states
I Apply forward-backward and Baum-Welchalgorithms
I A special case of theExpectation-Maximization (or generally,MM) family of algorithms
I Expectation step: forward-backwardcomputes posterior probs based on estimatedparameters
I Maximization: Baum-Welch empiricallyestimates parameters by averaging acrossobservations
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
I Forward algorithmI We compute the probability vector at
observation t: f0:t = f0:t−1TOt
I Each state (element of the m-state vector)can independently compute a sum-product
I Threadblocks map to statesI Threads calculate products in parallel,
followed by a log2(m) addition reduction
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Gridblock of threadblocks
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Speedups
I We implement 8 kernels. Examples:I Re-scaling transition matrix (for SNP
spacing)I Serial: O(2nm2); Parallel: O(n)
I Forward backwardI Serial: O(2nm2); Parallel: O(nlog2(m))
I Normalizing constant (Baum-Welch)I Serial: O(nm); Parallel: O(log2(n))
I MLE of transition matrix (Baum-Welch)I Serial: O(nm2); Parallel: O(n)
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Run time comparison
Table: 1 iteration of HMM training on Chr 1 (41,263SNPs)
states CPU GPU fold-speedup128 9.5m 37s 15x512 2h 35m 1m 44s 108x
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
An outline
Introduction to high-performance computing
Concepts
Example 1: Hidden Markov Model Training
Example 2: Regularized Logistic Regression
Closing remarks
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Regularized Regression
I Variable SelectionI For tractability, most GWAS analyses entail
separate univariate tests of each variable(e.g. SNP, GxG, GxE).
I However, it is preferable to model allvariables simultaneously to tease outcorrelated variables
I This is problematic when p < n. Parametersare unestimable, matrix inversion becomescomputationally intractable
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Regularized Regression
I The LASSO method (Tibshirani, 1996)I Seeded a cottage industry of related methodsI e.g. Group LASSO, Elastic Net, MCP, NEG,
Overlap LASSO, Graph LASSOI Fundamentally solves variable selection
problem by introducing an L1 norm to invokesparsity
I Limitations: Do not provide a mechanismfor hypothesis testing (e.g p-values)
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Regularized Regression
I Bayesian methodsI Posterior inferences on βI e.g.: Bayesian LASSO, Bayesian Elastic Net,I Highly computational. Scaling up to genome
wide scale is not obviousI MCMC is inherently serial, best option is to
speed up the sampling chainI Proposal: Implement key bottle neck on the
GPU: fitting βLASSO to the data
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Optimization
I For binomial logistic regression:I L(β) =
∑ni=1[yi logpi + (1− yi)log(1− pi)]
I pi = eµ+xti β
1+eµ+xt
iβ
I 5L(β) =∑n
i=1[yi − pi(β)]xi
I −d2L(β) =∑n
i=1 pi(β)[1− pi(β)]xixti
I For *penalized* regression:I f (β) = L(β)− λ
∑pj=1 |βj |
I Find global maximum by applying NewtonRaphson one variable at a time.
I βm+1j = βm
j −Pn
i=1[yi−pi (βm)]xi−λsgn(βm
j )Pni=1 pi (βm)[1−pi (βm)]xix
ti
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Overview of algorithmI Newton-Raphson kernel
I Each threadblock maps to a block of 512subjects (theads) for 1 variable
I Each thread calculates subject’s contributionto gradient and hessian
I Sum (reduction) across 512 subjectsI Sum (reduction) across subject blocks in new
kernel
I Compute log-likelihood change for eachvariable (like above).
I Apply a max operator (log2 reduction) toselect variable with greatest contributionto likelihood.
I Iterate repeatedly until likelihood increaseless than epsilon
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Gridblock of threadblocks
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Consideration of datatypes
I Need to compress genotypesI Why? Global memory is scarce, bandwidth is
expensiveI A warp of 32 threads loads 32 words
(containing 512 genotypes) into localmemory
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Distributed GPU implementation
I For really large dimensions, we can link upan arbitrary number of GPUs
I MPI allows us to spread work across acluster
I Developed on Epigraph: 2 Tesla C2050sI Approach
I MPI master node delegates heavy lifting toslaves across network
I Master node performs fast serial code, suchas sampling from the full conditionallikelihood of any penalty parameter (e.g. λ)
I Network traffic is minimized so slaves mustmaintain up to date copies of data structures
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Evaluation on large dataset
I GWAS dataI 6,806 African American subjects in a case
control study of prostate cancerI 1,047,986 SNPs typed
I Elapsed walltime for 1 LASSO iteration(sweep across all variables)
I 15 minutes on optimized serialimplementation across 2 slave CPUs
I 5.8 seconds on parallel implementation across2 nVidia Tesla C2050 GPU devices
I 155x speed up
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
An outline
Introduction to high-performance computing
Concepts
Example 1: Hidden Markov Model Training
Example 2: Regularized Logistic Regression
Closing remarks
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Conclusion
I Multicore programming is not a panaceaI Insufficient parallelism leads to an inferior
implementationI Graph algorithms *generally* do not map
well to SIMD architectures
I Programming EffortI Expect to spend at least 90 % time
debugging a black boxI Is it worth it? Human time > computer
time?I For generic problems (matrix multiplication,
sorting), absolutelyI OpenCL is a bit more verbose than CUDA,
but is more portable
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Potential Future Work
I Reconstructing Bayesian NetworksI Compute joint probability for each possible
topologyI Code graph as a sparse adjacency matrix
I Approximate Bayesian ComputationI Sample θ from some assumed prior
distributionI Generate a dataset conditional on θI Examine how close fake data is to the real
one
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Tomorrow’s clusters will requireheterogeneous programming
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Tianhe-1A
I World’s faster supercomputerI 4.7 petaflops (quadrillion floating point
operations/sec)I 14,336 Xeon CPUs, 7,168 Tesla M2050s
I According to nVidiaI CPU only: 50k CPUs, twice the floor spaceI CPU only: 12 megawatts compared to 4.04
megawattsI $88 million dollars to build, $20 million for
annual energy costs
Using multi-corealgorithms to
speed upoptimization
Gary K. ChenBiostat Noon
Seminar
Introduction tohigh-performancecomputing
Concepts
Example 1: HiddenMarkov ModelTraining
Example 2:RegularizedLogistic Regression
Closing remarks
Thanks to
I Kai: Ideas for CNV analysis
I Duncan, Wei: Discussions on LASSO
I Tim, Zack: Access to Epigraph
I Alex, James: Lively HPCdiscussions/debates