-A191 TMG FUNDAMENTAL ISSUES IN NULTIPROCESSIMG(U

-A191 $29 TMG FUNDAMENTAL ISSUES IN NULTIPROCESSIMG(U) 1/1MASSACHIUSETTS INST OF TECH CAMBRIDGE LAO FOR COMPUTERSCIENCE R A IANNUCCI ET AL OCT 87 MIT/LCS/TM-338

UNCLASSIFIED NSf14-83-K-125 F/G 12/6

mnEEEEEEEEmmmEEEEEEEEEmhmhmmfllflfflm

lllL 11111 4 .0-6I

* III'.

OTIC f ILECM

,AO, FMASSACHUSETTS"-LABORATORY FOREtNSTUEO"." INSTITUTE OFCOMPUTER SCIENCE TECHNOLOGY

MIT/LCS/TM-330

*,I' J

N0

TWO FUNDAMENTAL ISSUESIN MULTIPROCESSING

Arvind

Robert A. Iannucci

M AP 3

October 1987

V, 545 TECHNOLOGY SQUARE, CAMBRIDGE, MASSACHUSETTS 02139

is -- ,

.-.. ;=,.,:,. .. -- % N

- alassified

SECURITY CLASSIFICATION OF THIS PAGE

REPORT DOCUMENTATION PAGE

- la REPORT SECURITY CLASSIFICATION 1b RESTRICTIVE MARKINGS

*- 1'n.lassi' -d 40 1 /e2A 72a SECURITY CLASSIFiCATION AUTHORITY 3 DISTRIBUTIONIAVAILABILITY OF REPORT

2b DECLASSFCATION'UDOWNGRADiNG SCHEDULE Approved for pui].c r lease; distribution

is unlimited.

4 PERFORMING ORGANIZATION REPORT NUMBER(S) 5 MONITORING ORGANIZATION REPORT NUMBER(S)

IT! [AiI'S.-} 130 NOuO4-83-K-0125 and N00014-84-K-0099

6a NAME OF PERFORMING ORGANIZATION " 6o OFFICE SYMBOL la NAME OF MONITORING ORGANIZATION

.I[ Laiir itt Yv 1. ,r CLo:.puter (It applicable)0. i, A,-L2

6( 6 ADDRESS (City, State, and ZIP Code) 7b ADDRESS (City. State, and ZIP Code)

-'. 5-a To:,o,-v > quare in forrilti vterpLi i'ern

iA 021 -39 Arlinl.t)n, VA 222i17

- 8 NAME 0: UtDNG SPONSOR:NG 8Ib OFFICE SYMBOL 9 PROCUREMENT INSTRUMENT IDENT F!CATiON NUMBER .

ORGAN'ZATION (It applicable).\ \,POD

Sc ADDRESS(City, State. and ZIP Code) 10 SOuRCE O PoNDNG NUMBFS

1 401) .' I 5) 1 ! 'ti PROGRAM PROJECT TASK WORK UNIT.\i ~nton, VA 22217 ELEMENT NO NO NO 4CCESSOiN [.O

11 TITLE (include Security Classification)

X TWO__LNDA>ENTAL ISSUES IN .tULL IPROCESSIN(;

12 PERSONAL AUTHOR(S)

Arvind, and [nnucci , Robert A.

13a TYPE O; REPORT 13b TIME COVERED 14 DATE OF REPORT (Year. Month, Day) 1S PAGE COUNTc c 7n i4ca] FROM__ TO 1987 Octoberj 24

16 SU"'LEVENTAPY NOTATION

17 COWAT' CODES 18 SUBJECT TERMS (Continue on reverse it necessary and identify by block number)

FELD GROJP SuB-GROUP cache-s, cacne coherence, datailow architectures, hazardresolution, instruction pipci ining, LT,\I)/STORE architec-tures, r'emorv, iatenc, multiprocessors, multi-thread

1 9 ABSRACT 'Continue on reverse if necessary and identify by block number)

.', Icira purpose ;,mi]t1)1rocessor should ho sCalable, i.e. , show higher performance:!M :>, 1 i hardware resources are added ,to the machine. Architoect., of s'c> multiprocessors

i :, I.dres. the loss in processor ef Iic iency due t( twO fundamental issues: long memo rvannc s d waits due to syn'iironizato-n events. It i s argued that a well designed

n . m ,vercome these losses providCd there [s sufficient parall elism in the program:*'[' i, I :- ,cuted. 'the detrimental effect of Long Iatenc:v can be reduIced bv instruction

-i.pe ininc , }lowever, the restriction of n single tread of computation in von Neumannr"o r. sey'relv I 1 i ts thei r ah i I i ty to have mo-re than a few ins truc tions in the pipe-A?.

SI1'. Iirt itrii,,re, tchniquIes L,, redt Ire( tht memo rv Latency ten d to increase the cost of- Pw .tili T.he. 110 (os t ol snihron iz *t on eventL, in von ieumann machines makes decom-

i.,in , i :rogram int, very small tasks counter-productive. DIatal low machines, on the other.,. i, truat each ins t-ruet ion as a task, and by payin. a sImall synchronizati rn cost for

20 [),STR18,T1ONAVAiLABILITY OF ABSTRACT 21 ABSTRACT SECURITY CLASSIFICATION

7 NI :,'SSOED'IJNLM[TED f SAME AS RPT [] DTIC USERS Unelassifieda rAL 0 f') ONSBLE ;INDIJIDUAL 22b TELEPHONE (Include Area Code) 22c OFFICE SYMBOL

ti, Li, tt. 1hibl icat ions Coordinator (617) 253-589i

DD FORM 1473, 94 VAR 83 AOR edton may be used until exhausted SECURITY CLASSIFICATION OF THIS PAGEAll other editons arc? obsolete

* m. 1 sif Jet

| =.. . . . . . . .. M. . O-

: " " -. . . ' " % .". ".3"% ," _% ", ". ". . . -'- ''. '. ". .-.. . ., _ " " " ..- % -. -" " " .. . .. % .. " '.S - .

-r --% F - F . ' ..- Y . -- P F. P

r18. architectures, semaphores, synchronization, von Neumann airchitecture

19. each instruction executed, offer the ultimate flexibility in schedulingIinstructions to reduce processor idle time.

%-N

K

LABOATOR FORMASSACHUSETTSLABOATOR FORINSTITUTE OFCOMPUTER SCIENCE TECHNOLOGY

MITILCS/TM-330Replaces MITILCS/TrM-241

Two Fundamental Issues in Multiprocessing

A rvin d

Robert A. lannucci

4 To amar in the Proceedins ofon

"Parallel Processing in Science and Engineering"Junc 25-26, 1987

-' Bonn-Bad Godesbcrg

This rceor describes research done at the Laboratory for Computer Science of theMassachusett Institute of Technology. Funding for the Laboratory is provided in p art bythe Advanced Research Projects A enc yf the Department of Defense under Office ofNaval Research contracts N00014-83-K-Ol25 and N00014-84-K-0099. The secondauthor is employed by the International Business Machines Corporation.

545 TECHNOLOGY SQUARE, CAMBRIDGE, MASSACHUSETTS 02139

I%

-- y - - 7 - - -- - -. ' -W w _WJ.- _1('--.-_

Abstract

A general purpose multiprocessor should be scalable, i.e., show higher performance when morehardware resources are added to the machine. Architects of such multiprocessors must address the loss inprocessor efficiency due to two fundamental issues: long memory latencics and waits due tosynchronization events. It is argued that a well designed processor can overcome these losses providedthere is sufficient parallelism in the program being executed. The detrimental effect of long latency canbe reduced by instruction pipelining, however, the restriction of a single thread of computation in vonNeumann processors severely limits their ability to have more than a few instructions in the pipeline.Furthermore, techniques to reduce the memory latency tend to increase the cost of task switching. Thecost of synchronization events in von Neumann machines makes decomposing a program into very smialltasks counter-productive. Dataflow machines, on the other hand, treat each instruction as a task, and bypaying a small synchronization cost for each instruction executed, offer the ultimate flexibility i

scheduling instructions to reduce processor idle time.i Key words and phrases: caches, cache coherence, datallow architectures, hazard resolution, instruction

Spipelining, LOAD/STORE architectures, memory latency, multiprocessors, multi-thread architectures,semaphores, synchronization, von Neumann architecture.

This paper is a revision of our previous work on the subject:

. Version TitleCSG Memo 226-1 A Critique of Multiprocessing von Neumann Style

Presented at the 101h International Symposium on Computer Architecture,Stockholm, Sweden, June 14-17, 1983

w CSG Memo 226-2 Two Fundamental Issues in Multiprocessing. the Dataflow SolutionReprinted as MIT/LCS/FM 241. September, 1983.

CSG Memo 226-3 Two Fundamental Issues in Multiprocessing: the Dataflow SolutionAugust, 1985

CSG Memo 226-4 Two Fundamental Issues in MultiprocessingFebruary, 1986 0."

CSG Memo 226-5 Two Fundamental Issues in MultiprocessingJuly, 1986

>_ ..... .......F

". ...~~.. .... ..:: .". .". ... ....-..-. ." -.-. . .' -*" .. ..*'"

-'-

Table of Contents

1. Importance of Processor Architecture I1.1. Scalable Multiprocessors I1.2. Quantifying Parallelism in Programs 2

2. Latencv and Synchronization 52.1. Latency: The First Fundamental Issue 52.2. Synchronization: The Second Fundamental Issue 6

3. Processor Architectures to Tolerate Latency 83.1. Increasing the Processor State 93.2. Instruction Prefetching 93.3. Instruction Buffers, Operand Caches and Pipelined Execution 93.4. Load/Store Architectures 10

4. Synchronization Methods for Multiprocessing 134.1. Global Scheduling on Synchronous machines 134.2. Interrupts and Low-level Context Switching 144.3. Semaphores and the Ultracomputer 144.4. Cache Coherence Mechanisms 15

5. Multi-Threaded Architectures 175.1. The Denelcor HEP: A Step Beyond von Neumann Architectures 175.2. Dataflow Architectures 18

6. Conclusions 20

%

I

€.

.1

I

p.

List of Figures

Figure 1: The Effect of Scaling on Software 2Figure 2: Parallelism Profile of SIMPLE on a 20 X 20 Array 3Figure 3: Speed Up and Utilization for 20 x 20 SIMPLE 4Figurc 4: Structural Model of a Multiprocessor 6Figure 5: Operational Model of a Multiprocessor 7Figure 6: The von Neumann Processor (from Gajski and Peir [201) 10Figure 7: Variable Operand Fetch Time IFigure 8: Hazard Avoidance at the Instruction Decode Stage 12Figure 9: Latency Toleration and Synchronization in the HEP !8Figure 10: The MIT Tagged-Token Dataflow Machine 19

I

-a

I

I

I

. . . ... .'N a

' . . . . . . . . . . . . . . . ..°a . . . .-. - - - . . . . . . . . . . . . . . . . . at-

T oFundamental Issues in Multiprocessing

1. Imnportance of Processor ArchitectureParlle mchies avno up to several do/-i processors arc commcrcilff available nov:. Most ofI th:

dc ~igns are based onl von Neumann processor,, operating out of' a shared memory. Thie dii fereniwes in thearchitectures of- these machines in terms of' processor speed. inerior-\ orciani/aiw-n and con) muric at ~1.,;

sstns re sig~nificantl but they all use relativel v conventional \iont Ncum arm procvs, ors Thcsomachines represent the 1crieral belief' that processor architcture is of little in portanor ;n cincparallel machines. WVe will shoA the 1'llaCy Of this assumption On th: bNJ Is of ',\o j ISInC':: oi~lmor%

lenyand xvnchronization. Our areument is based onl the follknwing obseRva-6tls:

IMost von Neumann processor., atec likely to "idle dluring lorig mem ory refereaces, !rid SukIhreferences are unavoidable in parallel machines.

2Waits for synchronization events often require task switching, whitch is e ;ncnsivc_ <-Neumann machines. Therefore. only certain types oif pmrfleh .~r :-ai r xpo:t:efficiently.

0 We believe the effect of these issues onl performance to be fUbidamntal, ~nA:o a laroe degree.orthogonal to the effect of circuit tcechnolog .\V We% ill argue that by '2 " 'ling thle procek-scrpI-rl thejtrimnentai effeCt eol inemor 'v latenic ' iflo perfi rmalic c can he retlucc., pr ided there i Y l) :ral!cIstr in theproi'ramn. However, techniques for reducino the effect Or' latecy1 tenld to UncrCaw thC s-. ich oJ/Al 011

cost.

h i the rest of this, section, weQ articulate our assumptions vcgardiltg generaml pu PoTe par ICtomputers..We [hen discuss the often neglected issue of quantifyin the amount of prle~ nveasSection 2 develops a framewkork for defining the issues of latency and ss nchroiit oI,. Section 3eamines the met-hods to reduce the effect of memory latency in von Neumann compute:rs and discusses

thcIr limitations. Section 4 similarly examines synchronization methods and theilr cost. 'it Section 5. we-discuss multi-threaded computers like FIEP and the MIT Tagged Token Datatlowk machiine, and showhow, these machines can tolerate latency and synchroni/jtion Costs roic(1there is sullvnt paralllism

inporams Th atscin~man/es our conclusions.

1.1. Scalable NMutlt iprocessors

We are primarily interested in gern'ral purpose parallel compiaerr, e ,m C,)lPutfrs thlat Canl e'XploitS narallelism. when present, in anyv program. Further, we want multiprocessors to be wcal/a/c in such a

manner that adding hardware resources results in higher pcr1'oni ance 7 i cit co ring cagsinapp caionprograms. The focus of the paper is not onl aritrrl large machines, but mn chines which

~anc n i/e ro tn o touand processor,.. We expect the processrs, to be, at least as powerful as the

current microprocessors and possi bI ' as powecrfil as the C11'' sOf the current sape~rciputers,. In*particular, the context of the discu.ssion1 is1 P() machines with millions of one bit AlA *s, de(,/cns of Which

may fit On one chip. T'he design of such MachinfeS Will LCriainl v involve fundamnrtal issuc' iti addition totSe peetdhr.Msparallel machines that are available today or likely to be availablc in the next

few years fall Within the scope of this paper (e.g.-, the BBN Butterfly [361, ALICE 1.11 and nowILAGSI IP, the Cosmic Cube 135 and lntel',, iPSC, IMsRil 1 311, All iant and CEIDAR 1261. and

J, ~ ;RIPII 1)

If the programnming model ofI a parallel machine reflects, the mnachine cu nign Iration, t,~ nunmber of

0L-A "

r.

Redesign Rewrite Rewrite Recompil , Reiniiialize

the the the the the

ALGORITHM PRGJRAM COMPILER f; R C A M 1ESO!'R('!.V %IANVJ;E1?S

r S 7V a, K, r i t , v

P'rese-ves t, .;rcu .. tPrswr',e c tv ' r

Figure 1: The Effect of Scalin on Sotlt,.vir

processors and interconnection topology, the machine is not ,cdahlc in a pracucal ,cnse ('innrng hemachine configuration should not require changes in applcation prolrams or s\ stc'm solix are: up,3latingtables in the resource management system to reflect the ncw c ,ni.'urationl should be sulficient, v~cvcr.lew multiprocessor designs have taken this stance %ith regard to -,caling In fact, it is not uncommon: to1 ind that source code (and in some cases, algorithms) must be modified in order it run on an alheredmachine configuration. Figure 1 depicts the range of cflccts of scaling on ihe soft arc. Oh,iouLI,\. weconsider rchitectures that support the scenario at the right hand end of the scale to be far more dcsirablethan those at the left. It should be noted that if a parallel machine is not scalahle, then it will probably not

. be fault-tolerant; one failed prxessor would make the whole machine unusable. It is eas\ to dcsignhardware in which failed components, e.g., processors., Ma' be ma0.ke,1 oat. lio ,eVer. if the applicationcode must be rewritten, our guess is that most users would wait or the original machine configuration to

L be restored.

1.2. Quantifying Parallelism in ProgramsI ,ea!y, a parallel machine should speed up the Cxecution 0I a wrgram in proportion lo the naie:r ol

" processor;s in the machine. Suppose tn) is the time to execute a pto ramn on an n-processor inachine,*- lhe specd-up as a function of n may be defined as lolloks:'-.

atn)J .ou- up Is clearly dependent upon the program or pro'_ram,, oe, two the mcasurcment Iur,

., ,,,.ftm ×lios not have "sufficient parallelism, no paralcl :iichinc c an be expected to denl(mnN.trat,.". , ,r p, ,pdup Thus, it order to evaluate a parallel machine proper1%. "c need to characten,,c e1

4 nhci1 (r Ixltential parallelism of a program. This prc,,it,, a difficult problem because the anioui!t oI_r im in the source program that is exposed to the arc)tectu r, may depend upon th. qua! ty of th.

o,r or programmer annotations. Furthermore. there is no iW lson to a'x.rrme thai the source procrnan,, ih ,haoged, lndouhtedly. d:fferen ,t gorilhni, 1-i Iti roen t ,i\e dilte;cnr ,mrl 01-1 .,. aM the parallelim of an algorithm can be 1-k LUi,. (I:: The probiemi 11 &,r UUh

C ' 1 ihi mnot programmirw linguages do not Ih.1\" ,-I, i :h , 'pressl'\c p"k ci to ',ho" .11 !he

.i.

'St.

,f I , T ..Z"..',..% .-. '-"- ' "2Z. '";- .: " &, Z''" '-. ' ". • %'"%:..' '

.

-3-

Concurren t ly

Eitabled

Activities

SUbound,'d Processors

/ Finite Processors

,1000)

I T - Time

200 400 600 800 1000 1200 1400 1600

Figure 2: Parallelism Profile of SIMPLE on a 20 x 20 Array

possible parallelism of an algorithm in a program. In spite of all these difficulties, we think it is possibleto make some useful estimates of the potential parallelism of an algorithm.

It is possible for us to code algorithms in Id [301, a high-level dataflow language, and compile Id* programs into dataflow graphs, where the nodes of the graph represent simple operations such as fixed

and floating point arithmetic, logicals, equality tests, and memory loads and stores, and where the edgesrepresent only the essential data dependencies between the operations. A graph thus generated can beexecuted on an interpreter (known as GITA) to produce results and the parallelism profile, pp(t), i.e., thenumber of concurrently executable operators as a function of time on an idealiled machine. The ideali7edmachine has unbounded processors and memories, and instantaneous communication. It is furtherassumed that aP operators (instructions) take unit time, and operators are executed as soon as possible.The parallelism profile of a program gives a good estimate of its "inherent parallelism" because it isdrawn assuming the execution of two operators is sequentialized if and only if there is a data dependencybetween them. Figure 2 shows the parallelism profile of the SIMPLE code for a representative set of

Sinput data. SIMPLE 1121, a hydrodynamics and heat flow code kernel, has been extensively studied bothanalytically 1l1 and by experimentation.

The solid curve in Figure 2 represents a single outer-loop iteration of SIMPLE on a 20 x 20 mesh, whilea typical simulation run performs 1W0,000 iterations on 100 x 100 mesh. Since there is no significantparallelism between the outer-loop iterations of SIMPLE, the parallelism profile for N iterations can be

. obtained by repeating the profile in the tigure N times. Approximately 75% of the instructions executed..volve the usual arithmetic, logical and memory operators the rest are miscellaneus overhead

0%

,,~~~~~~. . ... .... , . -. ,. - .-. ' '. ..,- .- v , .- .?. ., ). ," - "

, , .' , -. ' p., * 2 , *,'_ ,'r''.d ' -"

.

4-

" Utilization -

Ideal - -"t<u

Speed up / I -

I--7-

.0,4 //-ct I '!

_Speed up _ _ _ _

.0 120 180 24( , 30 " 420 4 i(

Maximum Permissible Oportitr n: p,-r Tio, Step

Figure 3: Speed Up and litili/ation fo 20 SIMPLE

,.,,,e of them peculiar to dataflow. One can sly dedace the parallelism profile of an\ set of

,,.?e rr from the raw data that was used to generate the profile in the figure: ho'.ever, classi F'.,opciator as overhead is not easy in all cases.

The reader may visualize the execution on n proce-,or, by drawing a horizontal line at n on thea.-,eli m profile and then "pushing" all the instructions which ar, above the line to the nriht and helov.,e line Tie dashed curve in Figure 2 shows this for SIMPLE on 0OO processors and was generated h\

iL; i'fi,- graph interpreter by executing the program again wilh le constraint that no more thal N-,,r , were to be performed at any step. Howev,,r. , o d e.iate for t(n) can be m \I. -,,

: !,C-,Cnsively, from the ideal parallelism profile as follows. Fo;r any . f, f <n, we perforim aV pp,(z."rtiens in time step T. However, if pp(')>n, then we assume it ,ili take the least iteor greater than

-N steps to perform pp(tr) operations. Hence.

S rrn) YPP"tI1

,1 ,,,x is the number of steps in the ideal parale ,, pr, le ()uLr of (n conscr at, e... the data dependencies in the program ma., pNmritm ;c c\c,:uti(,n of some frn.lrustromF. , ic last time step in v, hivh instructions from Pt,1 kT a;e C\Ck cd

In ur Jataflow graphs the number of instructions J c'oicd Aoeis not :haogne vhen the pro,, ,m i,S..... .on a lifterent number of processors, ence. r I i , J:i : undcr h eIlk l

nh ClI-n '10 rz(V)C,-jw1t(l )I t and u'i-ao' e. 011" 'n II\VJP

I . rt.' t ilatl,)nftw is thml a. prm :rani has n paralLt' I v :.., 10,' y! J lz tr t}[a ";; !I'' ?7 l ' d : r i: on(

. ;, -C agu.d that (his prol c n do os Il a ha c c i oy,, h ', I II.I I k& . ] 1(4 ) ,e-0,'., . t1

K•

uiid.(!)n the Other hand, if' we cannot keep 10 processors fully utilized, %Ae cannfot blame the lack ofparallelism in the program. Generally. under-utili/ation of- the machine in thle preserve of massiveparalleliSm 'Stems f'rom aspects of' thle internal architecure o; thle process;ors whiich precl ude exploitationof'certain types of parallelism. Nbachine, are seldom designed to cx ploit inner-loop0, ouIter-loop, a> wcll asIinst ruc tionl-leve l parallelism sirnuhaneousl\.

% It is noteworthy Lhat thle potentialI parallIelIismi va ries tremcndouL\ d uring cxcntion. at hA,. It% i or I iich in* ~~~our experienice is t\ pica! 0of even L1he most higly parallel program s. W e Ccta ~~lrcp~eca

that run.s for a long time must have su flicient parallelismn to keep) hundred, of piew> r w'i ied: severalapp! ications- that wev have studied support ,his belief'. I lowever. tpa)rallel mntw a h ie1 ts !(N h, ft' e:cpurpose and programminable for thle User to be able to c xprce, eVenl if II:~ a . '"iet'r

ctlatin -b~d ~mu!at onprograms represented b% SIVtF\ 1.1

2. Latency and Sx'nchronization

W\e now di CLoSs theI issues or' latenc\ and ,vnchtronizatwon. We. bed(\ n- s ad t - rond afunction of' thle phv.,ical decomposition of a niultiprceN . wi \cretaL w.-nfunctionf Of 11h',k programis are logca 1v dotip d.

2.1. Latencv: The First Fundamental Issue

Ary ulprocessor organization canl be tlou ehlt Of aIs anf in)terconP. 111112k i : oc types oimodule, (see Fioure 4,:

*Processing elements (PE)1: Mlodules wAhich perfotrni -atithmectic anmd logical operaid onT) .tdlaa. Eachi processing elemnit has a in:communication port throin~lh which a!t datava1lues are rceived Processing elements iaitcract with other processing I elements '0y

,cnl~ 11c,;ags, inlerrupts orsnigadreceivino svacltron i-bo Lvignazls through,!iamcd memorN% PI7: 'ntract with mncmiory elements by issuini- LOAtD anid SFOrRLjiii w> c n>nd ihcd As, nccess ar\ with titoiflicitv constraints. Processing clem'tits areJi,,atcri/ed h.w the rite at Ms ic:h the\ cani process tnstnictions. As mention-c. s assumethe )nstrnk tons a impe ~'fxd and floating point scalar arli hrnctkL. More complex1u1-mic:tIOon,-il 1-knh counted ain mul tiple ins tructions for measurng, insttuction rile.

2 \L-11101Nt clemirent s (M) \lodu les ss hi ch Store data1:. Fach mnemoryN element has a sineleCon:ioi. t nfxrI Ln r\ eet respond to rccLjues ts 1I sLued b' Thle prucessing

cce t h\ re rii .t *''I'1 the :0111111u1nIctIon pr.and are chaira1cteri./'ed -v thleir;!.''. 2: rai .c :1: T c 'v:nild to :lr'c r~iet-

nniihi~tia 1,~ei o i at three comnnication ports, (iornmunicatioti ecemntsi,neier itc liki 1eeC;,. m. l~n,'.''~a~ l-~~n.o ata:I ritbr, thle\

'teni los 2 ce : ii OtC Of 111C Coninlun11ia J101 p)orts to onec ora-if. '.'' icr, en i.111011 porns (Tn nIton elemients ire chairacteri/ced hv the

j!,' per ransa.,sipn. and the constraints impovsed b\ oneo in1 11 .;''AIT I;' 11 lI1,I1[he nIntt am11ount of data that nma\ be Lcwxeed

.2'..~I. ''c li ft,1 11 capse. octII ~cnna IC rejUest ant recci inp [the associated response,, The

I 1 71 ur I, e T, 17\ Irsse Ic ~I TI Io' VetV I h 'TI ihc' N I C C 1 1!I1rlr is\C

P i ' i I, T, dcii t ri ' 1 , I _jir i\:r ~ hn'~ic~r.'n a.~

-6-

M1 M Ni M M Ni NI M,

7 C I/LC CC

C C C C C C -- 7r-CPE PE PE PE PE PE iI",

Figure 4: Structural Model ot a %iu1Iyjrokcssor

.Aho\ e niodel implies that a PE in a multiprocessorste !~'U'( lzv-er larttfl( v in mc'Pifrv rc/'eren. s i tn az wz':proc('.%str system because of the transit time in !he communication ncl~kork betweecn PL :s and t11ciocnorlcv, The actual interconnection of module-s may differ .4reatlv from machine to machine. I or

* \eplc.in the BBN Butterfly machine all memory elements ate at an equal distance from all protcssors.khile in IBMI's RP3, each processor is closely coupled "itli a miemoryN element. However, Ac assumec

itet the average latency in a well designed n-PE machinc should hc O(logn). In a von Neumiannr r,,css or. memory latency deter-mines tie time to execute memnory reterence instructions. Usuallv, the4 ee r~ccMemory latency also determiries the maximunm inslninion processing speed. When latency-an! ~t tbe hiddenci via overlapped operations, a tangible pcirfoninance [xenalty is incurred. WVe call the cost

.'e ;'dkith latency as the total inducL'dprocessor idle fidn' attributable to the latency.

* 2.2. S% nchroniization: The Second Fundamental Issue

'A' " i1 call the basic units of computation into which ipro-:-anc- ire deccomposed for pc rallef execuimon* ~ iaru~i1tasks or simply tasks.. A general model ()I 1Feral ci programmins! must assume thlat task

* .rc ~rea-d dynamically during a computation arid die after having produced and consune da.s~..likms in parallel programming which require task S. nchroriidtjoni include the tfolowing basic

1. P-odu(cr-Coflsimcr: A task produtces a (Llt 'f'eti r it i, read b anothecr teik, IfPTrOd1uCer and consume tass arc executed in pare! lcl, nclirniainsneddtavdtercal-before-write race.

2, A -rk and Joinsv: T'he join operation forces a C\cmnst to -Tl et in1dicat1ing that two taskswhich had been started earlier by some forkim' oper-ation lie' c in fact completed.

1 iiril Vlclus ion: Non deterministic events " lich ni i : N, processe-d one at at time. e, v,nali,'ation in the use of' a resource.

i~ iinimal support for synchroni/.ation Can IV pro0\ ded 'L III i In nstruct ion,. such as io!oniicr.i I \N1) SF-T that operate on variables, shared by syn hrmiiiii !asks 1 lowever. to clarify the truc cost

:1,j lrCjj, necessary. ati)nik suct~r .Jh vs 1\I fl r : ',,,' ha. l, !1 AhO, .' hue

n ~ '.:~ (t~c~tifls SeCC Sctiin 41..

-7-

d pecld Tasks

and Ready- t o-Exec u e Tasks

Requests

Comm Newly Created Tvskc

plusM Nemxory

Requests and Resplou4,-

Suspende T tI

SYNC PR C

and ~H eadv-to- Exec ute Ta3k s

Newly (Created TI k-

Figure 5: Ope rational Model o I a MvIulIti processor

ot' such instructions, we will use thle OperttonzaI Mfodl presented in Figure 3 . A.k: inl the operationalmodel have resources, such as registers and tmemory, associated \A.ith them and constitwt theo smallest unitof indepeOndently schedulable work onl the machine. A task is in one of the three states; rt adv-to-executc.

cvctin osusperided. Tasks ready for execution may N- quudgclo loballN. When s-lected, a1task occupies a processor until either it completes or is suspended waiting for a ,vnchroni/ation signal. A

Sask changes from suspendlet to ready-to-execute when another task cau ses the relevant s~ nchronizationevent. Generally, a sus pended task must he set aside to avoid deadlocks"- The Cost as soeicted with suchj s\Ttehronization i,, f the fixd time to execute the svnihroniz-ation in, truetion plus the, tiie taken to S.itcht,~ (1PHhcr task. 'Uhe cost of task switching can be high because it usually involves saving the processor:,cc, that i ,. the fontes t associated withi tite task.

t~~drthe ca-se of a single proce, sor systemi which ii. ist execute n coopetitlng tasks

I%

The ,re are several subtle issues4 in accounting for sN neti );,/atic- . %it'.r event toi cii,: w~r i{.cat As ncd,: a niame, such as that of a register or a m.-rlj,r c~n and! ('s nchoni,: ilion c("I'1.11.1also neiludc the instruction- that generate, match aid reuse dc-J1'crs sAhich namie s~ nchwm/,nio'~nclvents It mnay not he easy to identify the instructions c-o-ute~ ! or this purpose. NCctes suchrsi1ri_.Itions represent overhoad because they wold not no prose iti I: t:,,e program vkcre wntlen t10 ect

On a si.'gle sequential processor. The hardware design UsUafik dt t he number of names, available lor%vncaironi.ation as well as the cost of their use.

The otheri subtle issue has to do with the accouniting !or tnr .' a.v \n, hrimnizaion. Asv'sJlsenjection 3, most high performance computers overlap trio execution i:rucn-Lons h(2lon,_ing to on,: ,ask,

The techniques used for synchronization of ins tructions in such a stuLanut c . instructio, lisr: .ind_i, Victsioni are often quite different from techniques for inter-task svcrnao.It is usuitl\ saltt wnd

Jcaper not to put aside the instruction waiting for at sviiwhroii,.ation evNent. hut rather to idle (or,;~t9 to xecute No oil instructions while waiting). 'I his usuall\ done under tioe assun!)ption

.tIa u'e dle time will be on the order of a few instruction c~ckN,. \% c delrnc tlie synchronization cost il

.ec~ ~Ionsto be the indue ed processor idle time attnhbutale t W~ %aIT_ titr tThe s~nrchroni.ation ct..em.

* 3. Processor Architectures to Tolerate LatencyI:n this section, we describe those changes in I tn Nuriu, rhtcue hthv iccl eue h

I cCt of' memnory latency on performance. Increasing the processor state and instruction pipelIning1 are(liet~vomosteffective techniques for reducing the latencycot sngCa-I(vhptebstifiid

- .Awed1-sgnm to idte), we will illustrate that it is difficuit to keep miore than 4 or 5 instructions in thec.~ ~ -.,e:t umn prcso.I ilb hw htecrN chanlge in theC processor architecturelid hs permitted overlapped execution of' instructions has necessitated introduction of' a cheap

,, ntchronlzation mechanism. Often these synchronization mechanisms are hidden from the user and not',,,% tr itr-task synchronization. This discussion will fur-ther i'llustrate that reducing latency frequently

* ne:rcases synchronization costs.

*Bkfore: decricbing these evolutionary changes to hide latenicy, wAe should point out that the miemors* systcri i' I multiprocessor setting creates niore problemns than jist Increased latencv. Let us assume that

:iall memory modules in a multiprocessor form one global addr ess space and that any processor can readnvxolin the global address space This immediately brirties ul, the I ollowing proiblemns:

I he time to fetch an operand may not be constant because somec memories may be "closer"than others in the physical organization of the machine

*No useful bound on the worst case time to fetch at, nrupe d iiia\ be possbea ahn

design time because of the scalahility assumption This i,. at odds 'Aith RISC designs whichI treat memory access time as bounded and fixed.

pi-), essor were to issue several (pipelined) i:cnot. rcqkLtkc: to different remote memoryndcthe responses could arrive out of order.

A .1; ot theseL issues are discussed and illustrated In the tf'llo~xing sctions. A general solution lorace imemory responses out of order requires a ,Nnchroni/ai ion mechanism to match responses Nkith

AIe det lnation registers (names in the task's context and hic intietcions \.kaiting on that value. T'heTmcri eneor HEP 125! is one of- the very few archoP':c1turvs 'A th ha.s provided such mechamnismis in

* ft '.: oirne ann framework. However, the architecture oi i I, HT' is, tqilficient] v different from voncurrti arcitectures as to warrant a separate discussion seec Scctlotl

3.1. Increasing the Processor State

F -iwr t) dcpici h le miodern LJ3N IC'A ff 111 \("I "I '~,I I 1pnr[' ' wc "t 110t In the eailiestcoi pte s.-uias F DSA('. thil~Se O so Ii e Is ' ace ilitor qu .ticnt register,

arnd a program c-ounter. \lcrrorics %kcre rolativek slowcnp.ALm aC Si n is h ietlct ich ani In:;truction and its opcrands corln plecc\ doniinued ilhe insirt!i Cot 1 2 ~Agu h

- Arithmetic I. 'nit %kas of little usc [iie'tie nlernoi\ tesi me ccuid tc rc x, cc,

The appc.nrance of miul tiplc a ccumuitlators' reduced the number of operacd I I'iY'a store.,, andndcx rc :itcr%; I rim .tticai rc(:uccd the nun; her of' inlstrcitors cWIU1, i'cA In n tilc

ndorsk tno~ n eceSi .e oh inciory' tratlic was draslc2\ C \ .p.cesKcudrei

1,1 cr Iha l!r I Io \\ c-, th eli ; 1 n- d processo 1 statl ! Ce c I ~ T. L! hr1jrc ItcrI- I ,c t*, I I 'I U CrtI r I i tio H oM rOW ih . 10 J11 iivCrya rc.tac 1.jer 'ic Ai.lllproked cenF ith :I,,I. "')%Q In circuit "pc.:d

3.2. Instruction Prefetching

Thie timei take-n V > int on ectchl '1md pcrhiaps part of' in,,tri-.01 Me) dvi Crrc i'

if pre fe'.ba i\ &doac , d 11,c Cxtecutio)n phase of' the pr-evious i,,tr u '* i-iniIdt rkept ill sepaate i noles, is k ossible to overlap instruction p1o'!'c':n all,: opcr i I. f1'i a!0(The IBM STRET"1 1 anld Univac: LARC 167 represent two of the c, ie't alt',n-r at im i.! ii itn

o this idea.) Prefietchline can reduce the cycle time of the machine by ,Pty to h,7 peric e ~cI-Lkki11gupo te montoftime taken by thL first two steps of' tile instrcr ''vi t ) the complete

cycle. I [owever. the eff"ctive throughput of thle machinecanrri'".ppoonl 1 bcusoverlapped execution is not possible with all instructions.

Instruction prefetching wAorks well when the execution of instruction ni does rtot 7,tdve af\ ff oneither the c-hoice of instructions to) fetch (as is the ease in a RRANCID or the coritent olf the1 fetjhedinstruction (self-modifxing code) For instructions n+I. ni+2, ..., rk Mhe latter case is usally hand led by

siinlv u~tairr it.Howver.effctie o,,erlapped execution in tle presence ki' RkANC If vircio areinemd a problem. Techniques such as prefetching both HBRANCH targets hmve shown littlepcrlorntanee/cost benefits. Lately, thle concept of dvlayved BRANCHI intlrictionis from mircroprogramrminghas been incorporated, with success, in t.OAD/.STORF architectums (see Section 3.4). Thei ide-a is to delaythe ef'fct of a BRANCH by one instruction. Thus, the instruction at el± I following it BRANCH instruction atn sal!sex 2td regardless of' \vhich wav the BRANC1H at n Loes. 011e can al\\ a\% s follow a BRANCHinstruction witli a N-Oinstruction to get thle old effect. Hioweve,-r, experientce hiis shc-. n that seventyperent of the time at useful instruction ch e put in that position.

* 3.3. Instruiction Buffers, Operand Caches and Pipelined Execution

The timec ia fetch instructions can be further reduced by prOV'(fing a Fast interr.E'In machinessuhas thc C'DC 60WX j-10) and the- Craly-I 137], the instruction buiffer is auni~nalrc.Jli loaded with n

instruction', in thie neighborhood of' the referenced instruction (relying on spati:tl 1o al ity in codereferences), w hene,.%(r the re fernced ins tivction is% found to be- ri sI-iing. To take ad%. antace of' instruction

*buffers, it is also ne.cessary to %;peed uip the operand fetch and( execute phases. 'Ihr Iii us uall v done byproviding pncrarl.d caches or huffe rx, and overl appmgo the operand fetch and cxecution phases5 Ofcourse, balancing the pipeline under these conditions miay require further pipeliniig of the ALU. Ifsucce'SsfLI these techniques cart redluce the machine cycle timeL to one-fourthl or one-lfllh tho cycle time of

air ui itp d pe ain i. I towever, oveil appcd execution of four to five ins tnictions in thle von Neum ann

A ck -.ki[0 in Section 1At, .i~sin ai rnuw iproceqs.. r inrg create spec-i l-o~shcm

% %'0%

._ '.

-10-

Memory

Meniorv SIde

Processor Side

t Local iL o Processor Register,

Memory i

I / I /

Cache

Figure 6: The von Neumann Processor trm Ga ski and Peir 120l)0

franiework presents some serious conceptual difficulties, as discusscd netI.

Designing a well-balanced pipeline requires that the time taken b, various pipeline stages be more orless equal, and that the "things", i e., instructions, entenng the pipe be independent of each other.Obviously, instructions of a program cannot be totally indcecnd,nt except in some special trivial cases.Instructions in a pipe are usually related in one of txo was. Instruction n produces data needed byinstruction n+k, or only the complete execution of instruction n determines the next instruction to beexecuted (the aforementioned BRANCH problem).

Limitations on hardware resources can also cause instructions to interfere with one another. Consider!he case when both instructions n and n+1 require an adder, but therc is only one of these in the machine.Obviously, one of the instructions must be deferred until the other is complete. A pipelined machine mustb e temporarily able to prevent a new instruction from entering the pipeline when there possibility ofinterference with the instructions already in the pipe. Detecting and quickly resolving these hazard( isxcr, difficult with ordinary instruction sets, e g, IBM .70. VAX II or Motorola 6X(M), due to theircomplexity.

N maior complication in pipelining complex instrucion, is the variahie anmount of time taken in each- , stage of instruction processing (refer to Figure 7). Operand kltch in the VAX is one such example:

determining the addressing mode for each operand requires a Lair amount of decoding, and actual fctchine.:an involve 0 to 2 memory references per operand. ('onsilering all pos,,Nible addressing mode

,Cnmbinations, an instruction may involve 0 to 6 rnemor\ referenices in addition to the instruction tichitsclf A pipeline design that can effectively tolerate such vanaiois i, close to impossible.

3.4. Load/Store ArchitecturesSeymour Cray, in the sixties, pioneered instruction sets ( I)(" M t. ('ra I i hich scparate instruct iN

into two disjoint classes. In one class arc instruction, hi' me" i.rt,i '; 'eI bet l , s e C , e knl\h' eh speed registers. In the other class are instructions hich opcraic on dat i the I rgicsters Init lclioll

- -

• % c%

%'

, . : " '- rO _ e, I& W .. .AlN "Mll~ l I m n m u m rn

Yetclit Inturnl E 4

Fetcli Operands [7

I ,

in..... ----

Execut SrEYJ[i I I-

Figure 7: Variah: (pcrand bct' TimeThs.i i liin lo

o the ,,cold Cass cannot access the memor, This ngid di.,tanlion . \:i ,,'.n r .. 'eu!ine

For each irstruction. it is trivial to see it a memorN rclcrcet ! N, P,' ' , k" r , e,-v r, the.memor, s~stem and the All may be vie. e is paiallel, norrtcr-, c.r M, In'trutrn

dispatches exactly one unit of work to either one pipe or the oit I'r. iH ;ic\ n ,Such architectures have come to be known as tIAD ( S)R. ar-Chl:C'1Lurc,:, aki 11cl144 M0c h', uilt

-- Reduced Instruction Set Computer (RISC) enthusiasts (the 11 N I1 NOI 4, Po-,tke', v's R 1, I , andStanford MIPS 1221 are prime examples). t.O,\I)STORE ,architectUres u;e th" ttmc hs,'..,n instructiondecoding and instruction dispatching for hazard detection and rtsoluiion (ce [igunR ). The dc, ign of the

- instruction pipeline is based on the principle that if an tntructior 1ctv Past P 'NI c i,, p p1 - .saC. it, should be ale to run to completion without incurring any previousl unault. I,,ted ard I

-i.)AD/STORE architectures are much better at tolerating latcncie-. ini t,., aIl cce,",': t1:1 other vonNeumann architectures. In order to explain this point, c will tir,,t 11,,Li ss a simp' lied nidel whichdetects and avoids hazards in a LOAD/SrrORE architecture sinmilar toi the Crav- Isume iere is a bitassociated with every reister to indicate that the contents f the rcgster arc ndi-rg.-ng ichiange. The bitcorresponding to register R is set the moment we dispatch an instn!:t on tatx, *ants to updateR. Following this, instructions are allowed to enter the pipelin, only it O:ev do",,' need to reference ormodify register R or other registers reserved in a similar way Whencvc: a valu: is stored in R. the

* reservation on R is removed, and if an instruction is waiting on R, it is altiw -d to proc-cd. This simplescheme works only if we assume that registers \Ahose values are needed by an imnwtion are read beforethe next instruction is dispatched, and that the ALU or the multiple functional uwit, within the ALU arepipelined to accept inputs as fast as the decode stage can suppl\ tleO. '[he dispatching ot an irstruction

can also x- held up because it may require a bus for storing results in a clock ccle vhen the bus isneeded hy another instruction in the pipeline. Whenever BRANC(' mstructions are: eci ountered. the

* pipeline is effectively held up until the branch taret has bcen decided.

Notice what w ill happen when an instruction to load the contents of some memory locatio;n M into someregister R is executed. Suppose that it takes k cycles to fetch sonlething from the mcnorv. It % ill he

"In, ecd, in the Crav-.1, functional tnits can accept an input every clock cycle and regislers are alkax s read ri one clock cyclefter an mi-.ru'tion is di,;patched from the !)cceder.

.'1 '''' .2 ""+ -- "' ."""' -" " ' "' . € ,' ."' -"' , ."" .'"" . 0"" .' ' '

+ - '"'' - "'"" "-> 2 ." -+- - _' ' -

0,"- - , " " % " "% " ", " " " - % '" % ' % ' " % " " " ,' " " "" . " -

Bik0Bank 1 ~"1 ,

Itri 0 il

I th

PrefetchB tuffers

* Decoders

v ~Serial izatIion

_________ Hazard

Ave id an ce

Furictional tIjjjt

Dispatch

Figure 8: Hiaitard Avoidance at thc hI n nn ot ord I cg

*i:to cK cute several instrucltion during thc,,c K i, 'maco of thcri- rlfer to rc 'iqcr1' . Ir.Li ausituation is hardly' diffecret from the ('ic if)i < p, lo N, loaded f-am sonic IMInut iiat

- ii;l. kc the Ficating Point multiplier. takes sevcrW c ',raluce thuli rcsuIl T hcsc gaps in the;pciic .o .:! he I urher reducod it' the compiler reorder,, -i ftrc ' h that instruict ions COnn1iTO

! :tV. aic- put a,, far as possible from instruchions pro(1ula i~i,!iW (l1_12.I Thoi, %ke notice thai machincsho'idtr high pipclining of' instructions can hide dm2rv'ur i itt ncie\ provided thirc i, Ioci

* :Y.~flamorli!- instructions'.

- b.i c~rdt it.5 oil poccesor register,,1s it 1., 111Y Ofn~ ot l~' namr al iii H I

0. rTI n~i~wn IC I 1 !I I

I ~ IOT1w, fl~rlllI'''4 SI1l. T 'd, lP' 'i~'c '-IC . ' ~ ,%

N.%p

eL...............................................

--II: 11 th. (A --w -as o--; \T . r.'v-'r'v-WrJn'yw -II[l j -r 'j~ r . ''4,

II

Iu rcI rIc I t no n ru c I ~ k )th~ r Ii J' I JI II \ Ir Irt h: - v d ti"LJ 1"' I[ I1 ~ N CI I I ' h iiC Nk)IC

Cu'i ' c nc Ia\ cW ALWern'I C! .1 1, T\ a' K I< 11 J' I* 0'

r Cr'( £TAV 1 C 11 . OtB1I I t Ii' 1-IkCd1 1

Soi- I N.II L I "C III IifC

111p Ic ,r I.a w rrrnn ha.'ard rco 1 iffOn ol]i it IN ~ fic X, . * I' rtI I

r'.,j'. u IW r1T'tcI0ons xchcrcvecr iic'.:i r\ hJcc ii, th'e n;t o .t

10. Ilk.. tO I Wt LIKI.ILnI tC

I LI I IC CO''' to1 [rc t~ Iiii fIl rugun0 31 111 I 1 l Ilk, c ~ k IN T I .

'22' 0 ii'd md I's tw t.iai 0 W. I' I,:' ,Il '' I I III II 'I ' tI , c 1

PC JC' I 1115 RISC 01j('hincS I o[ r ,L thox %i v iN varahlc -vit .. .4 . 2, . '

I1 I'l S'C nit-h-n '. w'i,' i-- i'rrx'J onT t h:is )f i -iL

tlr,,s'iri-! I, in tile ,:A ih, tile pipeline tops., Lui'-dilkrtr onle c,01 C Il;..'_ok :.& I,,i j,t4' /tT'd [(t tie iflic rt.ijnircd. Pm, ' uh mII Il kcT

4 ' , ('Ot

there can t-, either one or a y rx'nsali number of en iI-ro rc fer:;, K i ''C'. F-oreaipI c, min he('ax - I. no nic than t Our !ndcendcni atkrss I w 'ter~l 1, Ji. li , 1) 0 "

lek. It' [ile genecrated address cau~ces a hank . rr'.'ia, tht pipclnIIN '''pp VIV. KrT li IN

rcscrivcd Iii at icu)Lt three cxclcs.

1. )ADIYSf 'RE ar AChicrs, becausec of thecir SilillICPik nuf-'r'TIi ,. K >I 1 o;' nucinstruction', thaIn imachines wkith nliore C0ompl)cx r'T'tti LI T ' IIr ;.r. - ~ re a,,

* s\nchroiizitlon cost llovw'r. this N L.\:1T''acl T r~~'v.7 'i.o . 'wdrA

- ~~~possible hr simrier controln'?'rlin.

4I. Srichron ization. Methodls for \upoesn

- 4.1. Globalu Sched ii rig on Synchciorous imichi. ow"

1-I07 a Iittlly sx nclir'',noiis rnlultlIr'TKN tT'0C '1C !1 ,h)'N'. I .W K 1' 1011 J Ili *.I11': K ' I A . [

opela:irrols 'or e_-.cr\ cycle, on ever\, proces sor. All atlalIll\ _. 11 hcn" d ma etA Te ! I ~ Tf ~Tc~l'inrITmc -o (o11 dini~ a hori/ontallx m icio-irroerail > IiI!~a t' ' -I -%' r

! i~e M uCh colde rO'-cnei'atioii !Casihic anld 'Ur wlri I re-u Imii1> IT '4P(%. fl '111 riT[I ~ ~~i:;s''>uior, .nN -it processor>, Cy inec. iA Multifioxk KO'ipj'Itc-N \khlrh mie 1lise! oin prot-)oa1I : -5 , I"C I Ir\aill it of VA. ' 1 1 ;m ! 1( > ' !" 1 :1 1' 'I,' ILy wK.]V T KiI'Od to

'' 1')f 1 1_n ?I 004.1!, or \"LMI . (ja, hies beas ct ill,,ul-I aTh .''r ' ItIlplK'

111e in:R('iTlr- Wle TT2 LIciioIAl unit 'r prt-ccssinlg lemenr ''i 1 St111 'V I h kcv ni Iia\ini/ing

ile rh-c (ii reces,:, aiM TK'sI4I\ini jpotemi~Ll run -Illic' conflicts iii tile use of rc~KoirI-IN it compfle tie

rI'IecrL:iK(Nv and control transfers are ncrtc asin kISCahictre hut W4.1K. niultiplC'

'Ii' 41011 irc:'r oi'IrlpI t:' av hc'ing schiedulci inlstead~ of k'rilly olIC (;,, ';I tdie pvsSiity ofK'I 11I1ead r;111:1iig 1-111%al insructions inI parallel. such arcLhitectures, an' liighk appealing when one

rcatl/es tI lc fil tIt nrach inc's, avarlahhic now still essecntially decode l( nd H-patc'h iilstflk t ons one at a

14%1 *

~'C Lilai this techinique I" !tlctinr hi. LCto'

a ~ onl a small number A to! :: tp.. No\ ;ieA1C1i heSOII this iLeVCl.),,Cc r. tx. ornes intramcta It is un clear hov~*v~ t.h e iarn ic irage allocationi or

rctu r !ondewrninistuc and real-time constraints "il ii!,I\ no u , i eu

-i. 2 . Pintrupts and C~~I~e ontext Switchfmit.

.. \on NeCutlialn ma' hilie' are iabic (i i. i_: ierap-Nt>m~~til

- ' -'OF\ h'ied On such ma11,1h11Ne peCTili tt,. _.:rr 17M,-7_111711[ a,, Iteat> lol-

Ce.flotec, iir-upis a'e ra iter cki S 'r.:; ! .Q~rstate ik d

,I Pifie state-saJ\ irte nla\ fxe lorce! lh\ :11h 'at, o> -ii'h C aioi h0cUr_, Or It nmay )cc:ur expitetik. !. C. unldc! tu [ .. ;e :neuuncr. S !a ,it Ll;' Ner\

-t.cinor a suite of lc,,, coirple x Locs". l de; !0( o .:c state lw no napnsfh- '' a n to niote is that each interrupt "Wlo P&KII d inwuiv ni ti oI tiattic acros;s the

I iror\ interlace.

'L..OuIS discussion, "c concluded that L,1"ioe prOVQc'1 'KtC , 10 ' caus it provided a man:i"oeor latenc\ cost. In tr\ ing to solvc pwta n , to 1,'\h I svnc!r;,'ai cav

in aroos an interaction \A hich. we believe, is mor( ill i:. coiricihntl. Speci!Ica,1l\ IT! %\CE

*~~ ~ -n'' .mann processors, the 'ohs jous" synchro nization n1K' Cl-ia~n~liw'i n' ill: ot-iN \.sork ss clii xw ol infriiurit s\rnehroi iaton eCilts1 or s .i 1t1C flw:( 'o.1 statc A11iih must

(r \fljj Sidaiother As a. reduc,.ine- the o,,t ol fLi*,\onL makigntrut!CIITr111N entail increasing the Cost of memory01 LIMIcs.

is s4uch Xs the Xerox Al\ to 42, the Xerox lTr~it, J2 nillCt Symbtol ics' 3bW 11 1:_d te~hnique \klaLh masi\ be called mi, (,o;r, Y-v . ' Wtthing to al Io\v shifntz o0

'oxebs he /O dvie aapters. This is ac:comtpi 1' hV5 dtpliieating programimer-s nbl

~ier Aords, the proceessr ,tate. Thus in1n 011 I'llrctx I h prcce'zsor Car, tbe swilched,k.it hout causinrg an.% memory re terc.es, t a 1 eso sta[eO. TVhis dram iatical N

-t 0t proCcSing ceriain t\ pes of'C ents, the ase r~'ll Iinterrupts. As Far as \5ke kioskkI 1tedL the Idea of keepirnimultiple coritt :~ a a 'htipr ees',or setting twAith the possible

M 1to be d iscussCd in Section 4)L~ui iL . CJC :rdueSVnchron'/' u iti o,,i over

,h :call hold onI\r a 111ogle c\nkext. It W1a1 'Y 'V- Ci '7 Witi about adopting0 ihi> scheme to01 0!. ~I a tiortlot-il mecniory rceo'':i ,,, I

othis approach irm, 'iu Ii .. ra'osfi fv ha.c It a millI_in stq n~rie tr.;rh .iate te I.11tx jod. 1.i'A !Isel task

necesans tke cr t he Ill m. ,..ts tiihoi on ca ll tonl% havel.ItidepeIndentl coI1text>' V, 11t1Lou Colll' C; tIII Oin 'ot \l L17 unrssa

wciis and tIhe I It racompu)Itc'r

.. es.the niost conimo.lv 'LI;potlI . i . . . .i c1d,1 ' 'W 01

n IalC of a memiors, leat iiin A proce. h\ 1:":.c'i.~ ~ .~rtt o i 0.the other processor kec~sredurt tO set>. ' 'X :, Ltoct.teoe c1s ti

* a ~~~:. ~~. 'mi uimh ,mnchro'i i/alioi %ki S h .'i.i 2et(O. ' ik

J. P

all I is ps ofl yc hronh/atloll paradigmis mntino' ed c Arlier. Howevrcr. the s~ nchironi iation cost of usingsuch i Instruction can hesk verv high Fssernial I the processor that eseccutcs it g(os into a busv-wait

:1 o il does the processor g!et hiockcd. it eeeac xi~ T veflioiN references at cXCI)- iiistrflCtiofl1C~i until tire T~s ANt) St r insirac:tion is executed successfulk. I rflc;WI,ieiatOn ,)I i F.',i AND-Shi that

l\'rrit nion-bus\ %k aiti ng ip1xcontext \I, itchin1g inl the proc,,or xd ihil 3n' arc n: N. earik i c hep* ithc:r

It is possihlce to improve uponi the FENI-AND- SFF instruction in a1 11n u.1 iPrO CSs Or J n s ,ugg, xtld b\- te \t lr~-conputcr group 1-71. Their technique can be, ilL;Ntrated h\ the armc(1 ' %) <iP

1instruLction (all C oILuti in of' the RFPLA(T At)ti instruction). The institiction- 1re0lTrrs 11 -. Jre"''; e1d\AILR'. And "sOrk, Ni asol los. supploixe 1I,,0 processor,., I i ad j iCniKe5 , s,cnte ii M ,!;

* itsi~acti il~ trfl arncnt.0 (As I ndAJI respcti vely. After onentic-icill bcome A v y

1 Processors i and j %Ai I I rece ive,. respctIve ,r e ite I It c r (n A\ i -1.(i

(A) a' results. lndctermina~cs- is a dircct consequence of the race to updXate 11enlois' ''c'!

Anmc:~c ust ch(,osc beitss ccu a wide variety Of inilenIenCFtati0e !k- P 1! 0 - Onepo)ssibihtv iN th:,t the processor ni J il icnrpret the instructUin ssi .1 a cri( P-- i'k : -~n,

- \Vhi -ePT').be. such it "olut ionl does nect i nd much favor ee it' ,i __L-, 6c '.ahIe 11CnIeors- ~ ~ ~~~ Ir!Ic. c\scond! sc:heme implements tilt II ANtD Kt)P> in the nwmnr-t ia i :. * a -1 li~ve* hsnh% the CEDAR project I5 -j !'his t ~picall) results in a Nigniiic. iat redu Th;oj .' Aorks trafficl

-~~ bXcause a 'it111cs ol nliernur- ,tr,isactiouis from tI tc Tlcmnor,,'>, c: or.1tl ns- - TC s4 ~~~suggested - sthe V ~IA - I.lraconi jc r Lroup iiiplcenLs the instr, - i in i,- ic u t ei.'. i

This trnp~cnmntation calls f'or a coininoi,i packet communication iie"twlr(,. Cs I -)ci s r pr;o),ssOr\s* ~ 1 to1 ar1i polt memiorx - If two packcN collide, sa\ FF1 (11 AND) AIi) AA and I--!t iIf .ND) ,0)( A,X-, the*NXX tch e xtnacts the Xvalues v, and s -ornits a tec" packet [IEli H ANt) A lit)iA,X '-v , ,r trcis i: wi the

ntemiorX . and Ntores, the value of v - teporarily. When the mC11ors rduni the olk 'alne ol' location A.the NXS itch returns tss o values A and (A \ XV he min m i~prosemniti :N thati sorrte i- ncb rnizationNi tuatlions whlich would Ita~e taken () i nie cain he done in 0()g-,'r tim-e. It should hK- noted-. frowes-e.r,thait one esr c ncilNrce nic ria\ i soly s manyv asN 1o92n additions,. and implies Nuistntial hardware

*complhcxit% furtheri. the iss,.ue of processor idle time due to !ateccv lIxan ot been adiressed at all. In the.kor\st ease, the comiplexity of hardssare nras actually increase tiic I :ienc:\ ol going .brouo-i The sI. itch and

thusN :orniple tely overshadowk the, a IX-ant ij;,e of 'cornh nrg o\e r otlw r sin pit.rin leInrntat irns-

Thic simujlation, rc !ills re"porled hv >NYI 17 j show quasi-linear speedup onl the L" [r Acomputer (a shared-mernors\ machine \A ithi ordinry~r von \cumann processors\, employing FETCH AND Alt) ss'nchro~nizationl

for a lar-c \ ariets of Ncicitilic applications \\ c are not sure how to 1,1:Tpret these rasulo; without- ~kriossi ng rnian\ more Oetails o! theirsimiulaion model. - T Ikpossible inrterprer ''ons arc [the following.

I . Parallel branches of a computation hardlyv share any, data. thus,, the costlyX mn nual '- ai fusiorssnchroni,'ation is rarels needed in rea! aPplications.

2 The synchrotliiation cost ot using shared data can be aceptabl\% brouight downAr~ b)N nt~iCioustuse of cachabhloc/ion cachathle aninotations iii the soturce rf H:

6The seconld Point ii: a 1)cCM tict- arer a! cr reading the next sectionH.

4.4. Cac-he Coherence Mechanisnis

Whil hih-SticceSSI'lJ for reducing memory lafenc y in uniproce.ssors caches in a multiprocessor* setting itroduce a serious sy'nchroni/ation problem called cache ioherence. (ensier and [-cautner

110 (eiC ric the problem as follows: "A 'nernorY scheme is coherent if the value rrurned on a LOADnstruction ,. always the vailiu. given by the latest STORE instruction wvith the same addre sv It is easy to

* see that this may be difficult to achieve in multi process ing.

%

..................... . . . . .. . . . . . . . .

c~..r~~ca !txo-proc:essor system tighti ;A I a\rch' .5',, S , J ,1, o'VI .3che to which It hII Ph.,> SI u~ itone on each processor, and we know that the tasks aleC Lct '11 ' l .- niuniat-_hen:m mrsiuftid moemory cells. In thle absence of caches, this :,Atetu., c~~ *n~':Idt, ~r.H ~' c,; t 1 pn

LUthe stlared address is present in both caches. the 1hlI k. L"-' .11 read _,11I 'k ieft ii

j,: erseally changes caused by the other proces-,, I '~eh;uh&I~poe'r~

'Uf1LaIlCes that subsequent LOAL~s %A ill get the most rte1, ii" F lWS cdrt IFICU' tI 1 1*'ccdin terrv~mscof decrecased memior bandwidth.

\1 Xola'wrls to the cache coheence problcm center c:'0 r ' r(,,. os',Ifllon ~dhr lhorilsoL 1,! tim Oe p-ossi bi lity of cache incoherence. Gcrc rhll I N.s: ,*n , 1 ~,i id icaut Ic IjhICoI theC Cac1hed

is~t[, private or shared, read-only or read-write, etc., is ai\iu i ~c cm n i'. o~e\ er, thi1s:,l!e sonihotk has to be updated after cacni memory ,.ece ric'ijj -eitotit d"i's.'d',' arc gcerruih

rrhLexcept possibly in the domain of bus-orienited ."ftrroe o. lh so-ciled tnflpfl has:onuses the broadcasting capability of buses and nu c-;Ifr r trn all caches viten a processor

airpts a STORE to x. In such a system. at most one ' o~ ) ;'rolncngonatatminheso

- .'A 'I nd, therefore, system performance is going to [be a str,[j, tuoction of the sneopy bus' atbmlit\ to!anfle the coherence-maintaining traffic.

I possible to improve upon the above solution it sonle ., Ihoni, state int'orrnation ts kept Ai Ohitlhentry. Suppose entries are marked "shared" or "not)srie A processor can freel\ read sha,,red

btne ut an attempt to STORE into a shared entry imdelvcocsthaIt a~ddresS ito ippear oni thlehi' htentry' is then deleted from all the oiher COi( ' nrli eld "non-h-ared' in the

-or that had attempted the STOqfE. Similar action ,akes pfajc ,11cri the word to be N riticii iso'm the cache. Of course, the main memory must be Updatd before purging the private copN

_n\ ,:,:c. When the word to be read is missing from the 0 a)he 11ri snoopy bus May basec to firstc'n1'' p% privately held by some other cache it~t ~;m ic iete requesting cache. ]'he status ol

an; nt will be marked as shared in both caches. I-he adtix doe ol- keeping shared/non -shared%,kith every cache entry is that the snoop' i'u.., comes into ac~tion only on cache misses aind

aiuhrcd locations, as opposed to all LOADS and SP1)RF~s Even if' these solutions work-i'., bus-oriented multiprocessors are not of muchiitee to us because o' their obvlis

IIir a., we can tell, there are no known solutions to ca -he coherenice f-or non-bussed machines. Itckni reasonable that one needs to make cache.- ;iartall% ie.ilel t: the programnier by allowingy

F! Ark data (actually addresses) as shared or not sharco. !; . h~llton. instructions to flush an entry or(Ar entries from a cache have to be provided. ('ac~nicnn onl such machines is possible

* tin:cept of shared data is well integrated in thc high- level iJLLanuag or the programming miodel.h~e~o alkr) been proposed explicitly to interloA I loaikon fo)r writing or to by paste ah and

i iwfccssiT\'v) on a STORE, in either case, the perfomriatice goes down rapidly as thle machine is- ' I nieally, inl solving the latency problem \]a mnultig I cachecs, we have introduced the

- -- rii/ation problem of keeping caches coherent.

0. -,nrc tie. tb at, while not obvious, a direIct trade o~ t ol en ex\hetweeni dtCLream 're tb- or ad increasing the cachable or non-shared daYl~

r9J

2.-'

6"' VS 5.W> V VS~W' S 'h~.7V ~ ~ . ~ CUSv < 5 7V7#

5. 'Multi-Threaded Architectures

In order to, eduICe rnC1riio\ lAteic', Co'\[ It I' eh~l C cC : 1 tk !~hl', ; ' i MLile

ovcrl ippcd imtniorN requests. ii:proce ssor fiaio \ic'.k thr i, rc: .. ~ tnra

h Iav e t o he IIif lII p IIelne. We noc tC hat e m o f N\ ot 0) IoIC U r!i V N aw i _- , - a \

-little caipatbility for ripelining. \kith thc c\cepti0r' 0:Jfla\N rleteren.l eIrrV !1*i a

nci lIN hi'litadtiOn are lundamenICTtal:err \'uniannprocessor,, It List ohSCrve' inrLFcI ci1T 1-'

-i:1 mcmory rclcrencs can) CLk~ Ot Of Ordcr 11 h, ")c:r.'' J:rahCji tiro.rso~sin ust bc pr"i rl ded

One vk i% to overconi' the! lirt lccicv. is to idlav Cri I ~ T i

~saw inP the very loney instruction word archirct lUr!- 01Setionj I ~-~ovecrcome 1-y ne di ire a large rel ste:- se:t w i h ir'table rese,-,n:aroquircmon. ar somnewhat in con iffi Th sr tuat ron is 1'u u t!''

corn , m n ct ith each other. Support for chceap Cvcrr~a fP L

quick!\ and to ha'., a non-emptyr qIIuu Of tasks whlich a'lc ready W il (: 1by i nterlea\ Irig- muti jple threads of comiputationl and prey ine w, e( 'i ra, -

a~void busv-Aiits. Mlachines suppoil rrg multi ple threads and '~processes 10ok less and less like von Neumann machines, as the rev-1C 0e

In this section. we first discuss the erstwhile iDenelcor H.," _ ' OP ' iK .com merc ia!lly a ai lable m ulti -threaded computer. After that wc ieflv d-c .s , aee L:,, If> xs hiomay be reg'arde-d as an extreme example of mnachines with :-iultrplc iliia! . c: '- .0instruction constitutes an inclependent thread and onix norr-Suspenidcl Uirreeds ar,' ci -duicd to rcexecuted.

5.1. The Denelcor IIEP: A Step Beyond 'on Neumann ArcitecluresThebasc srucureof he IEP processor is shown in Figure q. The proccrslteot >~u~ sa

eight step pipeline. In parallel with the data path is a control loop whrchf irilitcs rl-- c= 'tk, xordsPSW's) of the processes whose threads are to be- interleaved for c\ec ution. The de!i J: a Ind tilc control

loop varies with the queue size, hat is never shorter thain eight pipe(- sterps Tlrr' 1-m'riu.m value Isintentional to allow the PSWA at tire head of the qu-rue to miii 1ate r inst n1S11in 1-ji 0 ci r'r urn acinr to thlehead of the queue until the instruction has co-npleted. If at 4ca&'1eieht Y'W s.rpro.senting eightprocesses, can be kept Inl Ohe qIIueu. the processor'\ pipelinte wil e; jai l this schlTIe -Is miiuch liketraditional pipel ining or iristructioris. hut w ith Pr inipori an:, Ii flk'renc. '-I te r!,-r- inst, La mtori!nriec

0 a~~rc I ikel v to be v caker hecre because adlaceri inst nlctions in the pij[ in are ~ I liih r-c- jrr t'.m m

- Thcrt, arc 21 -I5 registers inl cach procesor, each process has an ii Ic s WtPi.. th, acr Tcr atra

Ii 1cr-r -c~s. i e , inrcr-tli irad, communication ispossible '. a ti tc e -'i 'tcls h, o' c rl .lpp n register* alloc~ions he lIFP pro~ ides FITIttMPTY REFt 12 VH) hits onl eac reis, 1n ',il tti~t hits oit eaich

- ss oid in the data mcmor\. ',n instictiort 1ri,'Oiichlo, FIVT' Or RES ~i, i u IstN 1'( Iue' like a1No ) tP intrction ihe, programt counter of' if. )t-nccss, i t. PS W, %khich initi itthe fist ruct lol is niotincrcm~Ctcdi: PTe process c feCtively b u,. -wat but w~ithout blocking the proccssor. W len a process

*issues a tL \D) or 1;S')tOR inst ~uction, it is remioved fronm thle control loop and IS queued sc, -aratl\ in) theScedle- urictioi ('it 1SHY) v~ Iich also is~ ut-s the memirorN request. eqet lchaentsati sfied

b~ecas oIls01 i1 ilprolv).r t11, Ii., 1 1 t' sixrils resulIt ini rc,_I L,icIail 0 1 the PS W AI ith in the, StI s loop and also*ill rc\uiC of the :squ'> ire SF11 ;irjneles Up mniniorv rC1rmoIIs(es WiIi queue'kld FSW'. updateCs

rcgi~tcs as tr 11s1r i wi nriis the F-''s !I Ie c iritril loop)[

%

Control Loop

Opcode

Operands

Figure ): I aten, ol:ai 1 a

IiLP is capable up to a [vint of'u.n ~.i rii latency. At the same time it prnuY dk\ 1t ML i a

1:7 4 ,),ecence-bi~s in registers and main nlemor\ i :, ITi~ .pprnich i'he\ lb1, t,Tk.ca use there is a limit of one ouistdnim c. Id\ ih -roe. nf

'ilifl through Thared wrecisterN, can toc f~o~. '

*~ ~ w.... \snous impediment to thie soitv~arc J, 01C> 1K ,t! .11 fx " '"

I ~.r Though only X PSW', may bere;'wolIc to name all concurrent !asks of a n

I Ditalow Architectures

Aarchitctures [2, 15. 2 J. 23 rcprsent i . .

0 <. use dataflow graph,, as their inadoiiv ca hiiil miachine languagc\. .pccikf onI\ a pnlW~1i;

I'~ruuiie~for parallel and pipelined ce.'Ja:inaflow. graph for the expression j*bh ' . a'i' p4. ;

* . .~-. he fore the addition. hov~ o% er. the multi pt: w~n it i- . "'; Ir or c, an im

4 he advantage of this flexibilits hecornme ap, in t,1~ de A i

i,.' , Ail heccomi- a, ailahle ma-, not bek knm~i )LkI2 : \Or

i uiiO h)a may take longer than ca-niputatii'il Ix I! I:~ Nt, tLt h diffecrent operands ntt \,arN du' * nt~Of hec0 ai flow graphs do not forcc unnecessir\ ,'AN.r ~lauI

1.0) iCCOrijing to the avaflahilitx of the pri

0 i 2 ii r xci. Ut o nw ehanisni (it a dali 'A0 'k''i

Ar0- -

nxn Routing Network

a P E

/ " i

'Local Path

SW aiti ."- _ j

2:: I, l~ , ",.o, I

ucture ogra;;I-Structure Instruction Or', 1 r N

Storage L Fetch

- o

I ' ,I+ t

€.- !-'~~i gu re 10: The NI IT Taged-TokenI Dat a f1low '.a:lhi;•

'ALU

.,..(Mo Neurmann processor. We will hrnelly illustrate this using the il] Tlagged-Token arc hitecture /seet [ ure 1). Rather than following a Proe'ralm (,Funteur for the next instruction to Ie executed and then

etching operands for that instruction, a dataflow machine provides a low-level synchroniiationmechanism in the form of 1 4aiting-Matching section which dispatches only those instructions for whichdata are already available. This mechanism relies on tagging each datum with the address of theinstruction to which it belongs and the context in which the instruction is being executed. One can think')I the instruction address as replacing the program counter, and the context identifier replacing the framebase recister in traditional von Neumann architecture. It is the machine's job to match up data with thesanie tag and1 then to execute the denotedi instruction. In so doing, new data will be Produced, \k ith a nc\.

;7~~~~~~~~~~~.-.-..-..-.. ......... ...............-.-....-...-.-..-.....--. ,.,..-.v."- --,",- -. ?..-2--;-- ----

icwang th uco I-." t rl sJ. 1! tu,.Cci~'c; n

::h .agc han the size 01 the register array in at von e,1) cnat i'h': oe'v d.ik 1 Cesois; ton-biclcking: given that die opcrarW d ' a!- j, k. to a.r I i _k _ :sFon(1i11L

T*'.,inr 'ain he executed without further s% nchrornv/:.i'o')

~to tho waiting,-matching section whil Lis n1 'i rd r . h'in 01

o7,the MIT T-Aoe-,red -Token machine pmv ;de A~ii !~i i 1~~l's t1)t)( ''' g~rrie. Each word of l-s~nucturesac S K.d_' j ,U to \heter the

't.full or has pe-nding read 1~u:2.f I''e 'eo 01 a

~1 crof at data structure with the conx.;:.i -sr of lhaZ Jo a'*-<'I the-'cl to n.inipulate 1-structre stor-.. ohfs ax 1re' 'o to' rs ~r t raee.

i.JW Jfc contenits of the 1hword ot 1 \A. ' ~ord.ia, o, twrc concerns dictate that a wucd Wcx .%~ r h it. ..wi Jltok The

p.esartreats all 1-structure L)WranIhJL1 It [li'~.~ .te ll( tn ieeuted, a packet containing the tag of the o ' 'n o f the -c~ :ristrutiori is

COCO' to he proper address, possibly in a 6is.n w in~c ~ atal memor\momP eT'A quire waiting if tiv' dc'c, is not n-1--vit 'p - Oi th. he~euli ma~i he rkturned main'

uci,, nL 'iater. The key is that une instruction il ;R Al iU , I. C bcSUS[enC110 1ut. rig dlh timne.1 .eher, processing of other instructions may continue iimncuia clx aitcr initumton ot WCe operation.

tmemory, responses with waiting Instructions i. uiore via tags in the waiting-matching section-

* One adxantage of tagging each datum is that data from different contexts can be mixed frec'. in thev)l'. 't i n e Kecution pipeline. ThusiLr1i lee :c! vi V oy~!sc ' fet' l

cv-aLarctinlatency and. -ninimize the 7 Jc o.x .~ a ' oei"M -i ro isuso ht cxNc Ie most higl, pi)x'.cd cr: processor cannot miatch

ii 1, ' x'illtv of a dataflow processor in this regzard. A more comrplete discussion of dataflow machines :-. x .I th , cpqx of'this paper. At, overview of 0x: t()r-0m 1',, ~l Kc Dtto~

- I 'nc cani be found in [6]. A deeper undersiar.dir Ja.iuv :;.i~incs c.an he gkutil from 2~1~tnalbeit slightly dated, details of the ",ciJ i, iii ar1e.oti1 ,cl are gixeri iin il! and [51,

presented the loss of' performance due to ''.ridIai't.- 'v.ind wait'rti iat

V 1, tx~o fundamental issues in Ithe de ijni, a7o.o.' larteif c petident of the techno!ogy ifrne . 'Iaftc. ~oii .

presented it as such, these issues are also incenct fi, 'i21c c prog r.ti nm r.,- ide) UxCdo '~~'l''iocsor. If a multiprcvcessor. i : built otl: 7 m.mto i rpt.'os hn~'i toiin

(i;je to latency and synch~ronization ,vi ! .htc .c n ax

-), sIF reduction or dataflow programmingiit 101 I SC1Ila.ci

I IpOSSIhle to modify a von Neumann processor to niake it more witahlc i at building block for a* - 2,1,I:! iiiachine? In our opinion the answer is a qua! i fic ' teI( tx!o irirxn' I a ch:iric ntic\,

* .... 1 Wprocessor are ,plit-phase memnory oper~itt:-1 '2- Ja 0'I tI'ahift . :.' pitt a;1' (Il Intat onT'a~:si nstruc'tions. or whaitever tie scheul in'' m 'h o it in r if- nirlr ,o c''

nail/: Onn His in 0i1 '''.i' are e-vc; IN W '' a iuA ''''' xp

h] .) f owever, the rmore 'on,,tirrentlx active !h,,' 0' Of ,',l alxneh.e he T-r-'r i, the'ten~i for hardware-suppxirtcd svuTcbtoniatiani yilat' '' .;ttlt,-i !.". ifit otliciN 1 ' Ire ,ictix'elx

,!~~,'rl~~ hisi orl! thes 0T' (1'. 11''u v I >y t I' x r j

r)r C, 0i rrl* 5 'r,

% %

'Ile"'cst appc ii of von Neumna-nn plrocessors is thait theN are v41delv available anrd familiar. There is a

1CnJCn)C\ it) extrplate these facts into a belieft, v.J on Neuiao pro,-cs,,ors aire 'simple" and efficient.A\ tccliio1 all \ Nound case, can be made that well designed o~Ncunin proces sors, ai- indeed very

IIlIcn cin i sequential codes and reqluire less mremiory rundi. : th than datatlow, processors.th.c~cr.le e Itic ie nI. ot sequential threads disappears fast it ther ir( to( m anx int-rpn or if"

i,!1n,- of the processor dueC 1o laten~cv or .lata-dcpendenit haiard!s increases. 11,1paj Ulr Ir, 31 isi M C 'I;Zcatin 11 d atatlov, arch itctu res Ix ic wII k impro\ e the c ticinc% Of 1!, Nc ITH raiCced-l Okenl,;;lhiteCtIr'- on seunilCodes A ithout sacn ticinLe an of its dat il'h :~inaes We can aSu IC therc a,! r thai none of' these changes are tantam ount to i ntroducint, a pro cramn :ow iLc: iii the daiallo~k

- ~o Lik of space s c have not discussed the effect of multi -threaded architcciucs oin the csnliiin jLlnguage Issues. It i', important to realize that compiling ItO primitive dataflow, operatrs is a inuc Psimpler ta~.k than compiling into cooperating sequential thireads. Since t-he cost of inter-process;,c0I1M1LIniCAtiOP In a von Neumann setting is in uch gtreater th1an the cost of- cni-i.i-tioo- within aipro css, ithere is a preferred process or 'tri"size on a given architecture. Furthe-nnrire, phicen lit of

-- s\ nchironi/ition instructions in a sequential code requires careful plannino e iau , n iTR-0nacoo to Wait*for a s\ nchronii/xion1 event may experience very different waiting periods in, differcra lcati6ons in the*-proigrami Thu. even for a given, grain si:'.e, it is dif-icult to decompose a prourarn eTj i ml .Dataflow

graphs. on the other hand, provide a uniform view of inter- and intra-proceduralO .Nnch-11,r i1;tion anJ* communication, and as noted earlier, only specify a partial order to utiforceci dqe-Cndcr- ic'among III,* -instructions of aprogram. Though it is very difficult to offer aquarititat~ve nudcaycrte , e nelieve that an Id

Nouveau compiler to generate code for a niult-hreaded \von Ncumann compult "ill b significantlymore complex than the current compiler 1411 wihich generates fine gi am dataflow graphs for the MITTagged-Token dataiflow machine. Thtis dataflow computers, in addition to pi-oviding solutions to thefundamental hardware issues raised in this paper, also have comIpiler- tchnology to eXplOIL their fullU potenitial.

Acknowledgment

The authors w ish to thank David Culler for valuable discussions; on much of' the suic' matter oIf thispaiper. particulart\ Load/Store architectures anid the structure of' the Cray machines. Memrbers of theComputaition Str-uctures Group have developed mnan\ iools, withiotit which the anai\ sis of the Simple code'Nsould have ben-- impossible. InI particular, we would like to thank Ken Trauh for the ID Compiler andIDavid Culler and [)inarte Mlorais for GITA, This paper has benefited fromn nume-irous discussions with

- pe~ople both inside and outside MIT. We wvish to thank Nata'lie Tarbet, Ke,, Fraub. D--Od Culler. Vinod

Ka Ki hail a nd R ish i N. ur Ni Kh i I fo r s u gges t ion s to i mpirove ti s mnan usc ript.

-'FS

11 *7 71 - . _.W

References

\-vind and R. E. Bryant. Design Considerations for a Partial Equation Machine. Proccedings of'>.'c~~CComputer Information Exchange Meeting, Lawrence Lifv'rr ore Laboratory, Livermore, CA,

;optemhcr. 1979. pp. 94-102.

.Arvind and D. E. Culler. "Datafiow Architectures". Annual k'vitcws ?f Computer ScienceU 1(1986),

3. Arvind, D. E. Culler, R. A. lannucci, V. Kathail, K. Pingali, and R F ['fiomias. The Tagged Tokenoataflow Architecture. Internal report. (including architectural revisions of October, 1983).

4. Arvind and K. P. Gostelow. "The U-Interpreter'. C'omputer 15. 2 f'etiruar. 19 8 42-19.

5. Arvind and R. A. laninucci. Instruction Set Definition for a Ta"',ed-'Token Data Hlow Machine.Computation Structures Group Memo 2 12-3. Laboratorv for Computer Science, MIT, Cambridge, Mass.,

* . (Uunbrid Lo, MA 02 139, December, 198 1.

6. Arvind and R. S. Nikhil. Executing a Program on the I"1 Tagged-Token Dataflow Architecture.Proc. PARLE, (Parallel Architectures and Languages ilurope1 . Iindhoven, The Ncthcrlands, June, 1987.

7. Block, E. The Engineering Design of the STRETCH Computer. Proceedings of the EJCC, 1959, pp.48-59.

8. Buehrer, R. and K. Ekanadham. Dataflow Principles in Multi -processor SN stems. ETH, Zurich, and0 Research Division, Yorktown Heights, IBM Corporation. July, 1986.

9. Burks, A., H. H. Goldstine, and J. von Neumann. "Preliminary Discussion of the Logical Design of anElectronic Instrument, Part 2". Datamation 8, 10 (October 1962). 30-41.

10. Censier, L. M. and P. Feautrier. "A New Solution to the Coherence Problems in MulticacheSv.'.IEEE Transactions on Computers C-27, 12 (IDecemhcr 1978, It 1l2-l111.

11. Clack, C. and Pevton-Jones, S. L. The Four-Stroke Reduction Engine. Proceedings of the 1986.. ', Conference on Lisp and Functional Programming. Associattion I-r Computing Machinery, August,

-~~*pp 2 2o- 232.

12. 1 eic', W. P.. C. P. I end ri ckson, and T. E, RudyN. T]he S I NII (:'ode. I nte nial Report57,~ Lawrence Livermore Laboratory, LNMIivr!,re.VA I ChrUMaR. 1978.

*I.;. ).irl ington, J. and M. Reeve. ALICE: A NMulti-ProtcN'tr Reduction Macine for the Parallel*~~~ . ... u n Applicative Languages. Proceedings of tho:) I ( Onlicrence on Functional Programming

I- . i ingL du, nd Com putler A rchi tcctu re. Po rt smoInuth. N\I . 1 1 pp 56

14. 1)crirt. 13 LeIcture Notes in Ci~npit rI' *'ic( . iw ' ~ cr"Ion ol a Data Flo~L.- 1 .; T~ g In Prii Lm ii % I\ uatr /' p ' . . / :dr 1(a Pr> 'i.,r(lflhli(21(n,

- l~c~i15. J B DiaaFlo,-x Supcr ,,nriOLr\ 1 ', .:. 1I \..P) 41%-50 18(,

- 10. 1 .k j P' J C (1hu. \ H r~A % ' i IIw : '-.1 \V C \F'SNseI I* Pr- '.. .. iilC\ ' the' JHCJ ' .1)9 pp ',I)A

1. I 11cr. J \ ( !tlicl'. C P Rkwx \ I i, \1 ler A., WI l n

w-- i.. I\l lrir.\r ~ . rn~hI'ceig

ii .- i,,~,Ii ,,(,,,.nII,,.lr Ill-

* -23-

18. Ellis, J. R.. Bulldog: a Compiler for VLIW Architectures. The MIT Press, 1986.

19. Fisher, J. A. Very Long Instruction Word Architectures and the ELI-512. Proc. of the 10 1h,International Symposium on Computer Architecture, IEEE Computer Society, June, 1983.

20. Gajski, D. D. & J-K. Peir. "Essential Issues in Multiprocessor Systems". Computer 18, 6 (June1985), 9-27.

21. Gurd, J. R., C. C. Kirkham, and I. Watson. "The Manchester Prototype Dataflow Computer".Communications of ACM 28, 1 (January 1985), 34-52.

22. Hennessey, J. L. "VLSI Processor Architecture". IEEE Transactions on Computers C-33, 12, -. (December 1984), 1221-1246.

23. Hiraki, K., S. Sekiguchi, and T. Shimada. System Architecture of a Dataflow Supercomputer.Computer Systems Division, Electrotechnical Laboratory, Japan, 1987.

24. lannucci, R. A. A Dataflow / von Neuamnn Hybrid Architecture. Ph.D. Th., Dept. of ElectricalEngineering and Computer Science, MIT, Cambridge, Mass., (in preparation) 1987.

25. Jordan, H. F. Performance Measurement on HEP - A Pipelined MIMD Computer. Proceedings ofthe 10th Annual International Symposium On Computer Architecture, Stockholm, Sweden, June, 1983,pp. 207-212.

* •26. Kuck, D., E. Davidson, D. Lawrie, and A. Sameh. "Parallel Supercomputing Today anQ the CedarApproach". Science Magazine 231 (February 1986), 967-974.

27. Lampson, B. W. and K. A. Pier. A Processor for a High-Performance Personal Computer. XeroxPalo Alto Research Center, January, 1981.

28. Li, Z. and W. Abu-Sufah. A Technique for Reducing Synchronization Overhead in Large ScaleMultiprocessors. Proc. of the 12th, International Symposium on Computer Architecture, June, 1985, pp.284-291.

29. Moon, D. A. Architecture of the Symbolics 3600. Proceedings of the 12th Annual InternationalSymposium On Computer Architecture, Boston, June, 1985, pp. 76-83.

30. Nikhil, R. S., K. Pingali, and Arvind. Id Nouveau. Computation Structures Group Memo 265,Laboratory for Computer Science, MIT, Cambridge, Mass., Cambridge, MA 02139, July, 1986.31. Papadopoulos, G. M. Implementation of a General Purpose Dataflow Multiprocessor. Ph.D. Th.,Dept. of Electrical Engineering and Computer Science, MIT, Cambridge, Mass., (in preparation) 1987.

32. Patterson, D. A. "Reduced Instruction Set Computers". Communications of ACM 28, 1 (January1985), 8-21.

33. Pfister, G. F., W. C. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfelder, K. P. McAuliffe,E. A. Melton, V. A. Norton, and J. Weiss. The IBM Research Parallel Processor Prototype (RP3):Introduction and Architecture. Procecdings of the 1985 International Conference on Parallel Processing,Institute of Electrical and Electronics Engineers, Piscataway, N. J., 08854, August, 1985, pp. 764-771.

* 34. Radin, G. The 801 Minicomputer. Proceedings of the Symposium on Architectural Support forProgramming Languages and Operating Systems, ACM, March, 1982.

35. Rau, B., D. Glaeser, and E. Greenwalt. Architectural Support for the Efficient Generation of Code- . for Horizontal Architectures. Proceedings of the Symposium on Architectural Support for Programming

Languages and Operating Systems, March, 1982. Same as Computer Architecture News 10,2 andSIGPLAN Notices 17,4.

-24

36. Rettberg, R., C. Wyman, D. Hunt, M. Hoffman, P. Carvev b. H ' . 2ark, .. Vv1. r:x:I'Development of a Voice Funnel System: Design Repolt. 40'98, Boit -. , inck arid N,_ v rima Inc, August,

1979.

3'. Russell, R. M. "The CRAY- I Computer System". Communicanron- ,fACM 2,' ( i r, 1978.63-72.

38. Seitz. C. M. "The Cosmic Cube". Communications.'f IACM 2S, 1 Uanuar 1985), 22-33

39. Smith, B. J. A Pipelined, Shared Resource MIMD Computer. Proceedings of the 1978 lr.ternatioaflConference on Parallel Processing, 1978, pp. 6-8.

40. Thornton. J. E. Parallel Operations in the Control Data 4f(tX). r,.ccd:ngs oi tic J'.;J -, 1964, pp33-39.

41. Traub. K. R. A Compiler for the MIT Tagged-Token Dataflow Architecture S.M. Thcsis.Technical Report 370, Laboratory for Computer Science, MIT, Cambridge, Mass., Cambridge, MA02139, AUGUST, 1986.

42. ALTO: A Personal Computer System - Hardware Manual. Xerox Palo Alto Rec:arch r.,nter, 13aloAlto, California, 94304, 1979.

0

.

, . . . . . -.-. .-;. - . . . . - ., . . . .. . . -, - - - ,. - .....- ,-.,.

CFIILITR I L' I N L1 I

2 Copies:-:7?>tion ProcessinQ Techniques Office

7c vancued Research Projects Acencysn 1-oulevard

VA 22209

ffce of Naval Research 2 Conie's323 h Quincy Street~.ri~n VA 22217

Dr. R. Grafton, Code 433

DietoCode 2627 6 CopiesS1Research Laborator-Y

~i Zn r- =, DC 20375

Dcfe7nse Technical Information Center 12 Cop:iesCancron Stationieya: ri, 22314

N~ir~1Science Foundation 2 ComiesSor Comnutina Activities

80?' C. Street, N\.w.Vi~~:.onDC 210550

Attn: roacDirector

D.L.F. Poyce, Code 38 1 Copy

R~ escarclh Department

G. Honper, US NP) Copy

oa r 17nnt of the N av y~asfl~~ton, )C2034

% "

®r

SF

/174TF/*

9~ %

Documents

-A191 TMG FUNDAMENTAL ISSUES IN NULTIPROCESSIMG(U