A Case for Teaching Parallel Programming to Freshmen

1

A Case for Teaching Parallel Programming to Freshmen

ArvindComputer Science & Artificial Intelligence Lab.Massachusetts Institute of Technology

Workshop on Directions in Multicore Programming Education, Washington D.C. March 8, 2009

2

One view of parallel programming

Multicores are coming (have come) Performance gains no longer automatic and

transparent Most programmers have never written a parallel

program Different models for exploiting parallelism,

depending upon the application Data parallel, Threads, TM, Map-Reduce, …

How to migrate my softwareHow to get performanceHow to educate my programmers

It is all about performance

3

Another view of parallel programming

Every gadget is concurrent and reactive Many weakly interrelated tasks happening concurrently

cell phones- playing music, receiving calls, web browsing Hither to independent programs are required to interact

What should the music player do when you are browsing the web Ambiguous specs: Not clear a priori what a user wants

Infrastructure is a parallel database for processing queries and commands

Scalable infrastructure to deal with ever increasing queries The database is more than just records -- Many streams of

data constantly being fed in Each interaction requires many queries and transactions

Parallelism is obvious but interactions between modules can be complex even when infrequent

Even though the substrate is multicore,performance is a secondary issue

4

My takeModeling, simulating, and programming parallel and concurrent systems is a more fundamental problem than how to make use of multicores efficientlyFreshman teaching should focus on composing parallel programs; sequential programming should be taught (perhaps) as a way of writing the modules to be composed

Within a few years multicores will be viewed as a transparent way of simplifying and speeding up parallel programs (not very different than the way we used to view computers with faster clocks)

5

The remainder of the talk

Parallel programming can be simpler than sequential programming for inherently parallel computationsSome untested ideas on what we should teach Freshman

6

Parallel programming can be easier than sequential programming

7

H.264 Video Decoder

May be implemented in hardware or software depending upon ...

NALunwrap

Parse+

CAVLC

Inverse Quant

Transformation

DeblockFilter

IntraPrediction

InterPrediction

RefFrames

Com

pre

ssed

B

its

Fram

es

Different requirements for different environments - QVGA 320x240p (30 fps) - DVD 720x480p - HD DVD 1280x720p (60-75 fps)

8

Sequential code from ffmpeg

void h264decode(){int stage = S_NAL;

while (!eof()){ createdOutput = 0; stallFromInterPred = 0; case (stage){ S_NAL: try_NAL();

stage=(createdOutput) ? S_Parse:S_NAL; break; S_Parse: try_Parse(); stage=(createdOutput) ? S_IQIT:S_NAL; break; S_IQIT: try_IQIT(); stage=(createdOutput) ? S_Parse:S_Inter; break; S_Inter: try_Inter(); stage=(createdOutput) ? S_IQIT:S_Intra; stage=(stallFromInterPred)?S_Deblock:S_Intra; break; S_Intra: try_Intra(); stage=(createdOutput) ? S_Inter:S_Deblock; break; S_Deblock: try_deblock(); stage= S_Intra; break } } }

Parse

NAL

IQ/IT

Inter-Predict

Intra-Predict

20K Lines of Cout of 200K

Deblocking

The programmer is forced to choose a sequential order of evaluation and write the code accordingly (non trivial)

9

Price of obscuring the parallelism

Program structure is difficult to understand

Packets are kept and modified in a global heap (nothing to do with the logical structure)

Unscrambling the over-specified control structure for parallelization is beyond the capability of current compiler techniques

Thread-level data parallelism?

10

P ThreadsA (p)thread of each block

But there is no control over mapping

int main(){pthread_create(NAL);phtread_create(Parse);pthread_create(IQIT);pthread_create(Interpred);pthread_create(Intrapred);pthread_create(Deblock);}

Processors

NALthread

Parsethread

DeBlkthread

Intraprthread

IQ/IT threadInterpredict thread

Sleeping threads

This is an implementation

model

11

StreamIT (Amarasinghe & Thies)a more natural expression using filters

bit -> frame pipeline H264Decode {add; NAL();add; Parse();add; IQIT();add; feedbackloop{

join roundrobin;body pipeline{

add; InterPredict();add; IntraPredict();add; Deblock();}

split roundrobin;}}

Parse

NAL

IQ/IT

Inter-Predict

Intra-Predict

Deblocking

Given the required rates StreamIt compiler can do a great job of generating efficient code

Feedback is Problematic!

12

Functional languages (pH)Natural expression of parallelism but too general

do_H264 :: Stream Chunk -> Stream Framedo_H264 = let

fMem :: IStructFrameMem MacroBlockfMem = makeIStructureMemorynalStream = nal inputStreamparseStream = parse nalStreamiqitStream = iqit parseStreaminterStream = inter iqitStream fMemintraStream = intra interStreamdeblockStream = deblock intraStream fMem

in deblockStream

The language does not provide any hints about which level of granularity the parallelism should be considered by either the programmer or the compiler

FLs provide a solid base for building domain-specific

parallel languages

13

An Idea we are testing: Hardware-design inspired parallel programming

14

Hardware-design inspirationHardware is all about parallelism but there is no virtualization of resources

If one asks for two adder then one gets two adders – if one needs to do more than two additions at a time, the adders are time multiplexed explicitly

Two-level compilation model One can do a design with n adders but at some stage

of compilation n must be specified (instantiated) to generate hardware. Each instantiation of n results in different design

Analogy - In software one may want to instantiate a different code for different problem size or different machine configuration.

15

H.264 in Bluespecmodule mkH264( IH264 )// Instantiate the modules

Nal nal <- mkNalUnwrap();...DeblockFilter deblock <- mkDeblockFilter();FrameMemory frameB <- mkFrameMemoryBuffer();

//Connect the modulesmkConnection(nal.out, parse.in);mkConnection(parse.out, iqit.in);…mkConnection(deblock.mem_client, frameB.mem_writer);mkConnection(inter_pred.mem_client, frameB.mem_reader);

interface in = nal.in; //Input goes straight to NALinterface out = deblock.out; // Output from deblockendmodule

Modularity and dataflow is obvious

No sharing of

resources

No time multiplexing issue if each module is mapped on a separate core

16

H.264 Decoder in Bluespec Elliott Fleming, Chun Chieh Lin

NALunwrap

Parse+

CAVLC

Inverse Quant

Transformation

DeblockFilter

IntraPrediction

InterPrediction

RefFrames

Com

pre

ssed

B

its

Fram

es

Behaviors of modules are composableEach module can be refined separatelyAny module can be compiled in SW

Are there ideas worth carrying over to Parallel SW?

8K lines of BluespecDecodes 1080p@70fpsArea 4.4 mm sq (180nm)

17

What should we teach freshman

18

General guidelinesMake it easy to express the parallelism present in the application no unnecessary sequentialization no forced grouping of logically separate

memories

Separate and deemphasize the issue of restructuring code for better sequential performance

19

TopicsFinite state machines

choose problems that have a natural solution as an FSM

show composition and interaction of parallel FSMs

Dataflow networks with unbounded and bounded edges

show programming of nodes in a sequential language with blocking sends and receives

Types, modularity, data structures, etc. are important topics but orthogonal to parallelism; these topics should be taught all the time

20

Some challenges

No appropriate language or tools

Need to think up new illustrative problems from the ground up Fibbonacci, “Hello world”, matrix

multiply won’t do

21

TakeawayParallel programming is not a special topic in programming

Parallel programming is programming Sequential and parallel programming can be

introduced together

Parallel thinking is as natural as sequential thinking

Thanks

22

Zero cost parameterizationExample: OFDM based protocols

MAC

MAC

standard specific

potential reuse

ScramblerFEC

EncoderInterleaver Mapper

Pilot &Guard

InsertionIFFT

CPInsertion

De-Scrambler

FECDecoder

De-Interleaver

De-Mapper

ChannelEstimater

FFT Synchronizer

TXController

RXController

S/P

D/A

A/D

Different algorithms

Different throughput requirements

Reusable algorithm with different parameter settings

WiFi: 64pt @ 0.25MHz

WiMAX: 256pt @ 0.03MHz

WUSB: 128pt 8MHz

85% reusable code between WiFi and WiMAXFrom WiFi to WiMAX in 4 weeks

(Alfred) Man Chuek Ng, …

WiFi:x7+x4+1

WiMAX:x15+x14+1

WUSB:x15+x14+1

Convolutional

Reed-Solomon

Turbo

Documents

A Case for Teaching Parallel Programming to Freshmen