Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008

Parallel Computing Parallel Computing Approaches & ApplicationsApproaches & Applications

Arthur AsuncionArthur Asuncion

April 15, 2008 April 15, 2008

RoadmapRoadmap

Brief Overview of Parallel ComputingBrief Overview of Parallel Computing

U. Maryland work: U. Maryland work: PRAM prototypePRAM prototype XMT programming model XMT programming model

Current Standards: Current Standards: MPIMPI OpenMPOpenMP

Parallel Algorithms for Bayesian Networks, Gibbs SamplingParallel Algorithms for Bayesian Networks, Gibbs Sampling

Why Parallel Computing?Why Parallel Computing?

Moore’s law will Moore’s law will eventually end.eventually end.

Processors are Processors are becoming cheaper.becoming cheaper.

Parallel computing Parallel computing provides significant time provides significant time and memory savings! and memory savings!

Parallel ComputingParallel Computing

Goal is to maximize efficiency / speedup:Goal is to maximize efficiency / speedup: Efficiency = TEfficiency = Tseqseq / (P * T / (P * Tparpar) < 1) < 1 Speedup = TSpeedup = Tseqseq / * T / * Tparpar < P < P

In practice, time savings are substantial.In practice, time savings are substantial. Assuming communication costs are low and processor idle time is minimized.Assuming communication costs are low and processor idle time is minimized.

Orthogonal to:Orthogonal to: Advancements in processor speedsAdvancements in processor speeds Code optimization and data structure techniquesCode optimization and data structure techniques

Some issues to considerSome issues to consider

Implicit vs. Explicit ParallelizationImplicit vs. Explicit Parallelization Distributed vs. Shared MemoryDistributed vs. Shared Memory Homogeneous vs. Heterogeneous MachinesHomogeneous vs. Heterogeneous Machines Static vs Dynamic Load BalancingStatic vs Dynamic Load Balancing

Other Issues:Other Issues: Communication CostsCommunication Costs Fault-ToleranceFault-Tolerance ScalabilityScalability

Main QuestionsMain Questions

How can we design parallel algorithms?How can we design parallel algorithms? Need to think of places in the algorithm that can be Need to think of places in the algorithm that can be

made concurrentmade concurrent Need to understand data dependencies Need to understand data dependencies

(“critical path” = longest chain of dependent calculations)(“critical path” = longest chain of dependent calculations)

How do we implement these algorithms?How do we implement these algorithms? An engineering issue with many different optionsAn engineering issue with many different options

U. Maryland Work (Vishkin)U. Maryland Work (Vishkin)FPGA-Based Prototype of a PRAM-On-FPGA-Based Prototype of a PRAM-On-

Chip ProcessorChip ProcessorXingzhi Wen, Uzi Vishkin, ACM Computing Xingzhi Wen, Uzi Vishkin, ACM Computing

Frontiers, 2008Frontiers, 2008

http://videos.webpronews.com/2007/06/28/supercomputer-arrives/Video:

GoalsGoalsFind a parallel computing framework that:Find a parallel computing framework that:

is easy to programis easy to program

gives good performance with any amount of parallelism provided gives good performance with any amount of parallelism provided by the algorithm; namely, up- and down-scalability including by the algorithm; namely, up- and down-scalability including backwards compatibility on serial codebackwards compatibility on serial code

supports application programming (VHDL/Verilog, OpenGL, supports application programming (VHDL/Verilog, OpenGL, MATLAB) and performance programmingMATLAB) and performance programming

fits current chip technology and scales with itfits current chip technology and scales with it

They claim that PRAM/XMT can meet these goals

What is PRAM?What is PRAM? ““Parallel Random Access Machine”Parallel Random Access Machine”

Virtual model of computation with some simplifying assumptions:Virtual model of computation with some simplifying assumptions: No limit to number of processors.No limit to number of processors. No limit on amount of shared memory.No limit on amount of shared memory. Any number of concurrent accesses to a shared memory take the same Any number of concurrent accesses to a shared memory take the same

time as a single access.time as a single access.

Simple model that can be analyzed theoreticallySimple model that can be analyzed theoretically

Eliminates focus on details like synchronization and communicationEliminates focus on details like synchronization and communication

Different types:Different types: EREW: Exclusive read, exclusive write.EREW: Exclusive read, exclusive write. CREW: Concurrent read, exclusive write. CREW: Concurrent read, exclusive write. CRCW: Concurrent read, concurrent write. CRCW: Concurrent read, concurrent write.

XMT Programming ModelXMT Programming Model XMT = “Explicit Multi-Threading”XMT = “Explicit Multi-Threading”

Assumes CRCW PRAMAssumes CRCW PRAM

Multithreaded extension of C with 3 commands:Multithreaded extension of C with 3 commands: Spawn: starts parallel execution modeSpawn: starts parallel execution mode Join: Resumes serial modeJoin: Resumes serial mode Prefix-sum: atomic command for incrementing variablePrefix-sum: atomic command for incrementing variable

RAM vs. PRAMRAM vs. PRAM

Simple ExampleSimple ExampleTask: Copy nonzero elements from A to B

$ is the thread-IDPS is Prefix-Sum

Architecture of PRAM prototypeArchitecture of PRAM prototype

MTCU:“Master Thread Control Unit”: handles sequential portions

TCU clusters: handles parallel portions

64 separate processors, each 75MHz1 GB RAM, 32KB per cache (8 shared cache modules)

Shared cache

Shared PS unit: only way to communicate!

Envisioned ProcessorEnvisioned Processor

Performance ResultsPerformance Results

Using 64 procs

Projected results75Mhz -> 800Mhz

Human ResultsHuman Results ““As PRAM algorithms are based on first principles that require As PRAM algorithms are based on first principles that require

relatively little background, a full day (300-minute) PRAM/XMT relatively little background, a full day (300-minute) PRAM/XMT tutorial was offered to a dozen tutorial was offered to a dozen high-school studentshigh-school students in in September 2007. Followed up with only a weekly office-hour by September 2007. Followed up with only a weekly office-hour by an undergraduate assistant, some strong students have been an undergraduate assistant, some strong students have been able to complete 5 of 6 assignments given in a able to complete 5 of 6 assignments given in a graduate coursegraduate course on parallel algorithms.”on parallel algorithms.”

In other words: XMT is an easy way to program in parallel

Main ClaimsMain Claims

““First commitment to silicon for XMT”First commitment to silicon for XMT” An actual attempt to implement a PRAMAn actual attempt to implement a PRAM

““Timely case for the education enterprise”Timely case for the education enterprise” XMT can be learned easily, even by high schoolers. XMT can be learned easily, even by high schoolers.

““XMT is a candidate for the Processor of the XMT is a candidate for the Processor of the Future”Future”

My ThoughtsMy Thoughts

Making parallel programming as pain-free as possible is Making parallel programming as pain-free as possible is desirable, and XMT makes a good attempt to do this. desirable, and XMT makes a good attempt to do this.

Performance is a secondary goal. Performance is a secondary goal.

Their technology does not seem to be ready for prime-Their technology does not seem to be ready for prime-time yet:time yet: 75 Mhz processors75 Mhz processors No floating point operations, no OSNo floating point operations, no OS

MPI OverviewMPI Overview

MPI (“Message Passing Interface”) is the standard MPI (“Message Passing Interface”) is the standard for distributed computingfor distributed computing

Basically it is an extension of C/Fortran that allows Basically it is an extension of C/Fortran that allows processors to send messages to each other.processors to send messages to each other.

A tutorial: A tutorial: http://www.cs.gsu.edu/~cscyip/csc4310/MPI1.ppthttp://www.cs.gsu.edu/~cscyip/csc4310/MPI1.ppt

OpenMP overviewOpenMP overview

OpenMP is the standard for shared memory OpenMP is the standard for shared memory computingcomputing

Extends C with compiler directives to denote Extends C with compiler directives to denote parallel sectionsparallel sections

Normally used for the parallelization of “for” Normally used for the parallelization of “for” loops.loops.

Tutorial: Tutorial: http://vergil.chemistry.gatech.edu/resources/programming/OpenMP.pdfhttp://vergil.chemistry.gatech.edu/resources/programming/OpenMP.pdf

Parallel Computing in AI/MLParallel Computing in AI/ML

Parallel Inference in Bayesian networksParallel Inference in Bayesian networksParallel Gibbs SamplingParallel Gibbs SamplingParallel Constraint SatisfactionParallel Constraint SatisfactionParallel SearchParallel SearchParallel Neural NetworksParallel Neural NetworksParallel Expectation Maximization, etc.Parallel Expectation Maximization, etc.

Finding Marginals in Parallel Finding Marginals in Parallel through “Pointer Jumping”through “Pointer Jumping”

(Pennock, UAI 1998)(Pennock, UAI 1998)

Each variable assigned to a separate processorEach variable assigned to a separate processor Processors rewrite conditional probabilities in Processors rewrite conditional probabilities in

terms of grandparent:terms of grandparent:

AlgorithmAlgorithm

Evidence Propagation Evidence Propagation

““Arc Reversal” + “Evidence Absorption”Arc Reversal” + “Evidence Absorption”

Step 1:Step 1: Make evidence variable root node and create a preorder walk Make evidence variable root node and create a preorder walk (can be done in parallel)(can be done in parallel) Step 2:Step 2: Reverse arcs not consistent with that preorder walk Reverse arcs not consistent with that preorder walk (can be done in parallel), and absorb evidence(can be done in parallel), and absorb evidence Step 3:Step 3: Run the “Parallel Marginals” algorithm Run the “Parallel Marginals” algorithm

Generalizing to Generalizing to PolytreesPolytrees

Note: Converting Bayesian Networks to Junction Trees can also be done in parallel

Namasivayam, et. al. Scalable Parallel Implementation of Bayesian Network to Junction Tree Conversion for Exact Inference. 18th Int. Symp. on Comp. Arch. And High Perf. Comp., 2006.

ComplexityComplexity

Time complexity:Time complexity:O(log n) for polytree networks!O(log n) for polytree networks!

Assuming 1 processor per variableAssuming 1 processor per variablen = # of processors/variablesn = # of processors/variables

O(rO(r3w3w log n) for arbitrary networkslog n) for arbitrary networks

r = domain size, w=largest cluster sizer = domain size, w=largest cluster size

Parallel Gibbs SamplingParallel Gibbs Sampling Running multiple parallel chains is trivial.Running multiple parallel chains is trivial. Parallelizing a single chain can be difficult:Parallelizing a single chain can be difficult:

Can use Metropolis-Hastings step to sample from joint Can use Metropolis-Hastings step to sample from joint distribution correctly. distribution correctly.

Related ideas: Metropolis-coupled MCMC, Parallel Related ideas: Metropolis-coupled MCMC, Parallel Tempering, Population MCMC Tempering, Population MCMC

RecapRecap

Many different ways to implement parallel Many different ways to implement parallel algorithms (XMT, MPI, OpenMP)algorithms (XMT, MPI, OpenMP)

In my opinion, designing efficient parallel In my opinion, designing efficient parallel algorithms is the harder part. algorithms is the harder part.

Parallel computing in context of AI/ML still Parallel computing in context of AI/ML still not fully explored!not fully explored!

Documents

Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008