Upload
herman
View
34
Download
2
Embed Size (px)
DESCRIPTION
A Case for Teaching Parallel Programming to Freshmen. Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology Workshop on Directions in Multicore Programming Education, Washington D.C. March 8, 2009. One view of parallel programming. - PowerPoint PPT Presentation
Citation preview
1
A Case for Teaching Parallel Programming to Freshmen
ArvindComputer Science & Artificial Intelligence Lab.Massachusetts Institute of Technology
Workshop on Directions in Multicore Programming Education, Washington D.C. March 8, 2009
2
One view of parallel programming
Multicores are coming (have come) Performance gains no longer automatic and
transparent Most programmers have never written a parallel
program Different models for exploiting parallelism,
depending upon the application Data parallel, Threads, TM, Map-Reduce, …
How to migrate my softwareHow to get performanceHow to educate my programmers
It is all about performance
3
Another view of parallel programming
Every gadget is concurrent and reactive Many weakly interrelated tasks happening concurrently
cell phones- playing music, receiving calls, web browsing Hither to independent programs are required to interact
What should the music player do when you are browsing the web Ambiguous specs: Not clear a priori what a user wants
Infrastructure is a parallel database for processing queries and commands
Scalable infrastructure to deal with ever increasing queries The database is more than just records -- Many streams of
data constantly being fed in Each interaction requires many queries and transactions
Parallelism is obvious but interactions between modules can be complex even when infrequent
Even though the substrate is multicore,performance is a secondary issue
4
My takeModeling, simulating, and programming parallel and concurrent systems is a more fundamental problem than how to make use of multicores efficientlyFreshman teaching should focus on composing parallel programs; sequential programming should be taught (perhaps) as a way of writing the modules to be composed
Within a few years multicores will be viewed as a transparent way of simplifying and speeding up parallel programs (not very different than the way we used to view computers with faster clocks)
5
The remainder of the talk
Parallel programming can be simpler than sequential programming for inherently parallel computationsSome untested ideas on what we should teach Freshman
6
Parallel programming can be easier than sequential programming
7
H.264 Video Decoder
May be implemented in hardware or software depending upon ...
NALunwrap
Parse+
CAVLC
Inverse Quant
Transformation
DeblockFilter
IntraPrediction
InterPrediction
RefFrames
Com
pre
ssed
B
its
Fram
es
Different requirements for different environments - QVGA 320x240p (30 fps) - DVD 720x480p - HD DVD 1280x720p (60-75 fps)
8
Sequential code from ffmpeg
void h264decode(){int stage = S_NAL;
while (!eof()){ createdOutput = 0; stallFromInterPred = 0; case (stage){ S_NAL: try_NAL();
stage=(createdOutput) ? S_Parse:S_NAL; break; S_Parse: try_Parse(); stage=(createdOutput) ? S_IQIT:S_NAL; break; S_IQIT: try_IQIT(); stage=(createdOutput) ? S_Parse:S_Inter; break; S_Inter: try_Inter(); stage=(createdOutput) ? S_IQIT:S_Intra; stage=(stallFromInterPred)?S_Deblock:S_Intra; break; S_Intra: try_Intra(); stage=(createdOutput) ? S_Inter:S_Deblock; break; S_Deblock: try_deblock(); stage= S_Intra; break } } }
Parse
NAL
IQ/IT
Inter-Predict
Intra-Predict
20K Lines of Cout of 200K
Deblocking
The programmer is forced to choose a sequential order of evaluation and write the code accordingly (non trivial)
9
Price of obscuring the parallelism
Program structure is difficult to understand
Packets are kept and modified in a global heap (nothing to do with the logical structure)
Unscrambling the over-specified control structure for parallelization is beyond the capability of current compiler techniques
Thread-level data parallelism?
10
P ThreadsA (p)thread of each block
But there is no control over mapping
int main(){pthread_create(NAL);phtread_create(Parse);pthread_create(IQIT);pthread_create(Interpred);pthread_create(Intrapred);pthread_create(Deblock);}
Processors
NALthread
Parsethread
DeBlkthread
Intraprthread
IQ/IT threadInterpredict thread
Sleeping threads
This is an implementation
model
11
StreamIT (Amarasinghe & Thies)a more natural expression using filters
bit -> frame pipeline H264Decode {add; NAL();add; Parse();add; IQIT();add; feedbackloop{
join roundrobin;body pipeline{
add; InterPredict();add; IntraPredict();add; Deblock();}
split roundrobin;}}
Parse
NAL
IQ/IT
Inter-Predict
Intra-Predict
Deblocking
Given the required rates StreamIt compiler can do a great job of generating efficient code
Feedback is Problematic!
12
Functional languages (pH)Natural expression of parallelism but too general
do_H264 :: Stream Chunk -> Stream Framedo_H264 = let
fMem :: IStructFrameMem MacroBlockfMem = makeIStructureMemorynalStream = nal inputStreamparseStream = parse nalStreamiqitStream = iqit parseStreaminterStream = inter iqitStream fMemintraStream = intra interStreamdeblockStream = deblock intraStream fMem
in deblockStream
The language does not provide any hints about which level of granularity the parallelism should be considered by either the programmer or the compiler
FLs provide a solid base for building domain-specific
parallel languages
13
An Idea we are testing: Hardware-design inspired parallel programming
14
Hardware-design inspirationHardware is all about parallelism but there is no virtualization of resources
If one asks for two adder then one gets two adders – if one needs to do more than two additions at a time, the adders are time multiplexed explicitly
Two-level compilation model One can do a design with n adders but at some stage
of compilation n must be specified (instantiated) to generate hardware. Each instantiation of n results in different design
Analogy - In software one may want to instantiate a different code for different problem size or different machine configuration.
15
H.264 in Bluespecmodule mkH264( IH264 )// Instantiate the modules
Nal nal <- mkNalUnwrap();...DeblockFilter deblock <- mkDeblockFilter();FrameMemory frameB <- mkFrameMemoryBuffer();
//Connect the modulesmkConnection(nal.out, parse.in);mkConnection(parse.out, iqit.in);…mkConnection(deblock.mem_client, frameB.mem_writer);mkConnection(inter_pred.mem_client, frameB.mem_reader);
interface in = nal.in; //Input goes straight to NALinterface out = deblock.out; // Output from deblockendmodule
Modularity and dataflow is obvious
No sharing of
resources
No time multiplexing issue if each module is mapped on a separate core
16
H.264 Decoder in Bluespec Elliott Fleming, Chun Chieh Lin
NALunwrap
Parse+
CAVLC
Inverse Quant
Transformation
DeblockFilter
IntraPrediction
InterPrediction
RefFrames
Com
pre
ssed
B
its
Fram
es
Behaviors of modules are composableEach module can be refined separatelyAny module can be compiled in SW
Are there ideas worth carrying over to Parallel SW?
8K lines of BluespecDecodes 1080p@70fpsArea 4.4 mm sq (180nm)
17
What should we teach freshman
18
General guidelinesMake it easy to express the parallelism present in the application no unnecessary sequentialization no forced grouping of logically separate
memories
Separate and deemphasize the issue of restructuring code for better sequential performance
19
TopicsFinite state machines
choose problems that have a natural solution as an FSM
show composition and interaction of parallel FSMs
Dataflow networks with unbounded and bounded edges
show programming of nodes in a sequential language with blocking sends and receives
Types, modularity, data structures, etc. are important topics but orthogonal to parallelism; these topics should be taught all the time
20
Some challenges
No appropriate language or tools
Need to think up new illustrative problems from the ground up Fibbonacci, “Hello world”, matrix
multiply won’t do
21
TakeawayParallel programming is not a special topic in programming
Parallel programming is programming Sequential and parallel programming can be
introduced together
Parallel thinking is as natural as sequential thinking
Thanks
22
Zero cost parameterizationExample: OFDM based protocols
MAC
MAC
standard specific
potential reuse
ScramblerFEC
EncoderInterleaver Mapper
Pilot &Guard
InsertionIFFT
CPInsertion
De-Scrambler
FECDecoder
De-Interleaver
De-Mapper
ChannelEstimater
FFT Synchronizer
TXController
RXController
S/P
D/A
A/D
Different algorithms
Different throughput requirements
Reusable algorithm with different parameter settings
WiFi: 64pt @ 0.25MHz
WiMAX: 256pt @ 0.03MHz
WUSB: 128pt 8MHz
85% reusable code between WiFi and WiMAXFrom WiFi to WiMAX in 4 weeks
(Alfred) Man Chuek Ng, …
WiFi:x7+x4+1
WiMAX:x15+x14+1
WUSB:x15+x14+1
Convolutional
Reed-Solomon
Turbo