Parallel n Distributed Systems

8/13/2019 Parallel n Distributed Systems

1/44


2/44

means by 1!"#, the number of components per integrated circuit for minimum cost will be$#,.%%

Moore attributed this doubling rate to e#ponential behavior of die sizes$ finer minimumdimensions$ and 11circuit and device cleverness)).

In +,2$ he revised this la! as follo!s0

``There is no room left to s&ueeze anything out by being clever. 'oing forward fromhere we have to depend on the two size factors ( bigger dies and finer dimensions.%%

3e revised his rate of circuit complexitydoubling to +4 months and pro5ected from +,2on!ards at this reduced rate.

If one is to bu into Moore)s la!$ the %uestion still remains 6 ho! does one translatetransistors into useful OPS 'operations per second(7

The logical recourse is to rel on parallelism$ both implicit and e#plicit.

Most serial 'or seemingl serial( processors rel e#tensivel on implicit parallelism.

8e focus in this class$ for the most part$ on e#plicit parallelism.

The Memory/Disk Speed Argument 8hile cloc& rates of high6end processors have increased at roughl 9:; per ear over the past decade$"home and ?olding>home.

2


3/44

In man other applications 'tpicall databases and data mining( the volume of data issuch that the cannot be moved.

An analses on this data must be performed over the net!or& using paralleltechni%ues.

#cope of $arallel %omputing &pplications

Parallelism finds applications in ver diverse application domains for different motivatingreasons.

These range from improved application performance to cost considerations.

Applications in ngineering and Design "esign of airfoils 'optimizing lift$ drag$ stabilit($ internal combustion engines 'optimizingcharge distribution$ burn($ high6speed circuits 'laouts for delas and capacitive and inductiveeffects($ and structures 'optimizing structural integrit$ design parameters$ cost$ etc.(.

"esign and simulation of micro6 and nano6scale sstems.

Process optimization$ operations research.

Scienti!ic Applications ?unctional and structural characterization of genes and proteins.

Advances in computational phsics and chemistr have e#plored ne! materials$

understanding of chemical path!as$ and more efficient processes.

Applications in astrophsics have e#plored the evolution of gala#ies$ thermonuclearprocesses$ and the analsis of e#tremel large datasets from telescopes.

8eather modeling$ mineral prospecting$ flood prediction$ etc.$ are other importantapplications.

@ioinformatics and astrophsics also present some of the most challenging problems!ith respect to analzing e#tremel large datasets.

Commercial Applications Some of the largest parallel computers po!er the !all street

"ata mining and analsis for optimizing business and mar&eting decisions.

Barge scale servers 'mail and !eb servers( are often implemented using parallelplatforms.

Applications such as information retrieval and search are tpicall po!ered b largeclusters.

Applications in Computer Systems

3


4/44

et!or& intrusion detection$ crptograph$ multipart computations are some of the coreusers of parallel computing techni%ues.

=mbedded sstems increasingl rel on distributed control algorithms.

A modern automobile consists of tens of processors communicating to perform comple#

tas&s for optimizing handling and performance. Conventional structured peer6to6peer net!or&s impose overla net!or&s and utilizealgorithms directl from parallel computing.

Organi"ation and Contents o! this Course ?undamentals0 This part of the class covers basic parallel platforms$ principles ofalgorithm design$ group communication primitives$ and analtical modeling techni%ues.

Parallel Programming0 This part of the class deals !ith programming using messagepassing libraries and threads.

Parallel Algorithms0 This part of the class covers basic algorithms for matri#computations$ graphs$ sorting$ discrete optimization$ and dnamic programming.

===============xxxxxxxxxxx===============

$arallel %omputing $latforms

'opic Oerie Implicit Parallelism0 Trends in Microprocessor Architectures Bimitations of Memor Sstem Performance "ichotom of Parallel Computing Platforms Communication Model of Parallel Platforms

Phsical Organization of Parallel Platforms Communication Costs in Parallel Machines Messaging Cost Models and


5/44

Microprocessor cloc& speeds have posted impressive gains over the past t!o decades't!o to three orders of magnitude(. 3igher levels of device integration have made available a large number of transistors. The %uestion of ho! best to utilize these resources is an important one. Current processors use these resources in multiple functional units and e#ecute multiple

instructions in the same ccle. The precise manner in !hich these instructions are selected and e#ecuted providesimpressive diversit in architectures.

$ipelining and #uperscalar +ecution Pipelining overlaps various stages of instruction e#ecution to achieve performance. At a high level of abstraction$ an instruction can be e#ecuted !hile the ne#t one is beingdecoded and the ne#t one is being fetched. This is a&in to an assembl line for manufacture of cars. Pipelining$ ho!ever$ has several limitations. The speed of a pipeline is eventuall limited b the slo!est stage.

?or this reason$ conventional processors rel on ver deep pipelines 'D: stage pipelinesin state6of6the6art Pentium processors(. 3o!ever$ in tpical program traces$ ever 6-th instruction is a conditional 5ump Thisre%uires ver accurate branch prediction. The penalt of a misprediction gro!s !ith the depth of the pipeline$ since a largernumber of instructions !ill have to be flushed. One simple !a of alleviating these bottlenec&s is to use multiple pipelines. The %uestion then becomes one of selecting these instructions.

#uperscalar +ecution: &n +ample

5

=#ample of at!o6!asuperscalare#ecution ofinstructions.


6/44

In the above e#ample$ there is some !astage of resources due to data dependencies.

The e#ample also illustrates that different instruction mi#es !ith identical semantics canta&e significantl different e#ecution time.

#uperscalar +ecution Scheduling of instructions is determined b a number of factors0

E True "ata "ependenc0 The result of one operation is an input to the ne#t.E


7/44

"ue to limited parallelism in tpical instruction traces$ dependencies$ or the inabilit ofthe scheduler to e#tract parallelism$ the performance of superscalar processors is eventualllimited. Conventional microprocessors tpicall support four6!a superscalar e#ecution.

Very Long "nstruction ,ord (VL",) $rocessors The hard!are cost and comple#it of the superscalar scheduler is a ma5or considerationin processor design. To address this issues$ FBI8 processors rel on compile time analsis to identif andbundle together instructions that can be e#ecuted concurrentl. These instructions are pac&ed and dispatched together$ and thus the name ver longinstruction !ord. This concept !as used !ith some commercial success in the Multiflo! Trace machine'circa +,49(. Fariants of this concept are emploed in the Intel IA-9 processors.

Very Long "nstruction ,ord (VL",) $rocessors: %onsiderations Issue hard!are is simpler. Compiler has a bigger conte#t from !hich to select co6scheduled instructions. Compilers$ ho!ever$ do not have runtime information such as cache misses.Scheduling is$ therefore$ inherentl conservative. @ranch and memor prediction is more difficult. FBI8 performance is highl dependent on the compiler. A number of techni%ues suchas loop unrolling$ speculative e#ecution$ branch prediction are critical. Tpical FBI8 processors are limited to 96!a to 46!a parallelism.

Limitations of Memory #ystem $erformance Memor sstem$ and not processor speed$ is often the bottlenec& for man applications. Memor sstem performance is largel captured b t!o parameters$ latenc andband!idth. Batenc is the time from the issue of a memor re%uest to the time the data is availableat the processor. @and!idth is the rate at !hich data can be pumped to the processor b the memorsstem.

Memory #ystem $erformance: -andidt* and Latency It is ver important to understand the difference bet!een latenc and band!idth. Consider the e#ample of a fire6hose. If the !ater comes out of the hose t!o secondsafter the hdrant is turned on$ the latenc of the sstem is t!o seconds. Once the !ater starts flo!ing$ if the hdrant delivers !ater at the rate of gallonsGsecond$ the band!idth of the sstem is gallonsGsecond. If ou !ant immediate response from the hdrant$ it is important to reduce latenc. If ou !ant to fight big fires$ ou !ant high band!idth.

Memory Latency: &n +ample

7


8/44

Consider a processor operating at + H3z '+ ns cloc&( connected to a "


9/44

"ata reuse is critical for cache performance.

Impact o! Memory 'andwidth Memor band!idth is determined b the band!idth of the memor bus as !ell as thememor units.

Memor band!idth can be improved b increasing the size of memor bloc&s. The underling sstem ta&es ltime units '!here lis the latenc of the sstem( to deliverbunits of data '!here b is the bloc& size(.

Impact o! Memory 'andwidth# $ample

Consider the same setup as before$ e#cept in this case$ the bloc& size is 9 !ordsinstead of + !ord. 8e repeat the dot6product computation in this scenario0

E Assuming that the vectors are laid out linearl in memor$ eight ?BOPs 'four multipl6adds( can be performed in D:: ccles.

E This is because a single memor access fetches four consecutive !ords in the vector.E Therefore$ t!o accesses can fetch four elements of each of the vectors. Thiscorresponds to a ?BOP ever D ns$ for a pea& speed of 9: M?BOPS.

Impact o! Memory 'andwidth It is important to note that increasing bloc& size does not change latenc of the sstem. Phsicall$ the scenario illustrated here can be vie!ed as a !ide data bus '9 !ords or+D4 bits( connected to multiple memor ban&s. In practice$ such !ide buses are e#pensive to construct. In a more practical sstem$ consecutive !ords are sent on the memor bus onsubse%uent bus ccles after the first !ord is retrieved.

Impact o! Memory 'andwidth

The above e#amples clearl illustrate ho! increased band!idth results in higher pea&computation rates. The data laouts !ere assumed to be such that consecutive data !ords in memor!ere used b successive instructions 'spatial localit of reference(. If !e ta&e a data6laout centric vie!$ computations must be reordered to enhancespatial localit of reference.

Impact o! Memory 'andwidth# $ampleConsider the follo!ing code fragment0

for (i = 0; i < 1000; i++)

column_sum[i] = 0.0;

for (j = 0; j < 1000; j++)

column_sum[i] += b[j][i];

The code fragment sums columns of the matri# b into a vectorcolumn_sum.

Impact o! Memory 'andwidth# $ample The vector column_sumis small and easil fits into the cache

The matri# bis accessed in a column order.

9


10/44

The strided access results in ver poor performance.

Impact o! Memory 'andwidth# $ample

8e can fi# the above code as follo!s0for (i = 0; i < 1000; i++)

column_sum[i] = 0.0;for (j = 0; j < 1000; j++)

for (i = 0; i < 1000; i++)

column_sum[i] += b[j][i];

In this case$ the matri# is traversed in a ro!6order and performance can be e#pected to besignificantl better.

Memory System Per!ormance# Summary The series of e#amples presented in this section illustrate the follo!ing concepts0

E =#ploiting spatial and temporal localit in applications is critical for amortizing memorlatenc and increasing effective memor band!idth.

E The ratio of the number of operations to number of memor accesses is a good

indicator of anticipated tolerance to memor band!idth.E Memor laouts and organizing computation appropriatel can ma&e a significantimpact on the spatial and temporal localit.

Alternate Approaches !or (iding Memory )atency Consider the problem of bro!sing the !eb on a ver slo! net!or& connection. 8e deal!ith the problem in one of three possible !as0

10

Multipling a matri# !ith a vector0 'a( multipling column6b6column$ &eeping a

running sumN 'b( computing each element of the result as a dot product of aro! of the matri# !ith the vector.


11/44

E !e anticipate !hich pages !e are going to bro!se ahead of time and issue re%uests forthem in advanceN

E !e open multiple bro!sers and access different pages in each bro!ser$ thus !hile !eare !aiting for one page to load$ !e could be reading othersN or

E !e access a !hole bunch of pages in one go 6 amortizing the latenc across various

accesses. The first approach is called prefetching$ the second multithreading$ and the third onecorresponds to spatial localit in accessing memor !ords.

Multithreading !or )atency (idingA thread is a single stream of control in the flo! of a program.

8e illustrate threads !ith a simple e#ample0

for (i = 0; i < n; i++)

c[i] = dot_product(get_row(! i)! b);

=ach dot6product is independent of the other$ and therefore represents a concurrent unit of

e#ecution. 8e can safel re!rite the above code segment as0

for (i = 0; i < n; i++) c[i] = crete_t"red(dot_product!get_row(! i)! b);

Multithreading !or )atency (iding# $ample In the code$ the first instance of this function accesses a pair of vector elements and!aits for them. In the meantime$ the second instance of this function can access t!o other vectorelements in the ne#t ccle$ and so on. After l units of time$ !here l is the latenc of the memor sstem$ the first functioninstance gets the re%uested data from memor and can perform the re%uired computation. In the ne#t ccle$ the data items for the ne#t function instance arrive$ and so on. In this!a$ in ever cloc& ccle$ !e can perform a computation.

Multithreading !or )atency (iding The e#ecution schedule in the previous e#ample is predicated upon t!o assumptions0the memor sstem is capable of servicing multiple outstanding re%uests$ and the processor is

capable of s!itching threads at ever ccle. It also re%uires the program to have an e#plicit specification of concurrenc in the formof threads. Machines such as the 3=P and Tera rel on multithreaded processors that can s!itchthe conte#t of e#ecution in ever ccle. Conse%uentl$ the are able to hide latenc effectivel.

Pre!etching !or )atency (iding Misses on loads cause programs to stall.

11


12/44

8h not advance the loads so that b the time the data is actuall needed$ it is alreadthere The onl dra!bac& is that ou might need more space to store advanced loads. 3o!ever$ if the advanced loads are over!ritten$ !e are no !orse than before

Tradeo!!s o! Multithreading and Pre!etching Multithreading and prefetching are criticall impacted b the memor band!idth.Consider the follo!ing e#ample0

E Consider a computation running on a machine !ith a + H3z cloc&$ 96!ord cache line$single ccle access to the cache$ and +:: ns latenc to "


13/44

A tpical SIM" architecture 'a( and a tpical MIM" architecture 'b(.

SIMD Processors Some of the earliest parallel computers such as the Illiac IF$ MPP$ "AP$ CM6D$ andMasPar MP6+ belonged to this class of machines. Fariants of this concept have found use in co6processing units such as the MM units inIntel processors and "SP chips such as the Sharc. SIM" relies on the regular structure of computations 'such as those in imageprocessing(. It is often necessar to selectivel turn off operations on certain data items. ?or thisreason$ most SIM" programming paradigms allo! for an 11activit mas&))$ !hich determines ifa processor should participate in a computation or not.

Conditional $ecution in SIMD Processors

13


14/44

MIMD Processors In contrast to SIM" processors$ MIM" processors can e#ecute different programs ondifferent processors. A variant of this$ called single program multiple data streams 'SPM"( e#ecutes thesame program on different processors. It is eas to see that SPM" and MIM" are closel related in terms of programmingfle#ibilit and underling architectural support. =#amples of such platforms include current generation Sun ltra Servers$ SHI OriginServers$ multiprocessor PCs$ !or&station clusters$ and the I@M SP.

SIMD*MIMD Comparison SIM" computers re%uire less hard!are than MIM" computers 'single control unit(. 3o!ever$ since SIM" processors ae speciall designed$ the tend to be e#pensive andhave long design ccles. ot all applications are naturall suited to SIM" processors. In contrast$ platforms supporting the SPM" paradigm can be built from ine#pensive off6the6shelf components !ith relativel little effort in a short amount of time.

Communication Model o! Parallel Plat!orms There are t!o primar forms of data e#change bet!een parallel tas&s 6 accessing ashared data space and e#changing messages.

Platforms that provide a shared data space are called shared6address6space machinesor multiprocessors. Platforms that support messaging are also called message passing platforms ormulticomputers.

Shared*Address*Space Plat!orms Part 'or all( of the memor is accessible to all processors. Processors interact b modifing data ob5ects stored in this shared6address6space.

14

=#ecuting a conditional statement on an SIM" computer !ith four processors0 'a(

the conditional statementN 'b( the e#ecution of the statement in t!o steps.


15/44

If the time ta&en b a processor to access an memor !ord in the sstem global orlocal is identical$ the platform is classified as a uniform memor access 'MA($ else$ a non6uniform memor access 'MA( machine.

+,MA and ,MA Shared*Address*Space Plat!orms

+,MA and ,MAShared*Address*Space Plat!orms

The distinction bet!een MA and MA platforms is important from the point of vie! ofalgorithm design. MA machines re%uire localit from underling algorithms for performance. Programming these platforms is easier since reads and !rites are implicitl visible toother processors. 3o!ever$ read6!rite data to shared data must be coordinated 'this !ill be discussed ingreater detail !hen !e tal& about threads programming(.

15

Tpical shared6address6space architectures0 'a( niform6memor access shared6address6space computerN 'b( niform6memor6access shared6address6spacecomputer !ith caches and memoriesN 'c( on6uniform6memor6access shared6address6space computer !ith local memor onl.


16/44

Caches in such machines re%uire coordinated access to multiple copies. This leads tothe cache coherence problem. A !ea&er model of these machines provides an address map$ but not coordinatedaccess. These models are called non cache coherent shared address space machines.

Shared*Address*Space vs- Shared Memory Machines It is important to note the difference bet!een the terms shared address space andshared memor. 8e refer to the former as a programming abstraction and to the latter as a phsicalmachine attribute. It is possible to provide a shared address space using a phsicall distributed memor.

Message*Passing Plat!orms These platforms comprise of a set of processors and their o!n 'e#clusive( memor. Instances of such a vie! come naturall from clustered !or&stations and non6shared6address6space multicomputers. These platforms are programmed using 'variants of( send and receive primitives. Bibraries such as MPI and PFM provide such primitives.

Message Passing vs- Shared Address Space Plat!orms Message passing re%uires little hard!are support$ other than a net!or&. Shared address space platforms can easil emulate message passing. The reverse ismore difficult to do 'in an efficient manner(.

Physical Organi"ation o! Parallel Plat!orms8e begin this discussion !ith an ideal parallel machine called Parallel


17/44

Architecture o! an Ideal Parallel Computer 8hat does concurrent !rite mean$ an!a7

E Common0 !rite onl if all values are identical.E Arbitrar0 !rite the data from a randoml selected processor.E Priorit0 follo! a predetermined priorit order.

E Sum0 8rite the sum of all data items.

Physical Comple$ity o! an Ideal Parallel Computer Processors and memories are connected via s!itches. Since these s!itches must operate in O1-time at the level of !ords$ for a sstem ofpprocessors and m!ords$ the s!itch comple#it is Omp-. Clearl$ for meaningful values ofpand m$ a true P


18/44

Interconnection +etworks S!itches map a fi#ed number of inputs to outputs. The total number of ports on a s!itch is the degreeof the s!itch. The cost of a s!itch gro!s as the s%uare of the degree of the s!itch$ the peripheralhard!are linearl as the degree$ and the pac&aging costs linearl as the number of pins.

Interconnection +etworks# +etwork Inter!aces Processors tal& to the net!or& via a net!or& interface. The net!or& interface ma hang off the IGO bus or the memor bus. In a phsical sense$ this distinguishes a cluster from a tightl coupled multicomputer. The relative speeds of the IGO and memor buses impact the performance of thenet!or&.

+etwork Topologies A variet of net!or& topologies have been proposed and implemented. These topologies tradeoff performance for cost. Commercial machines often implement hbrids of multiple topologies for reasons ofpac&aging$ cost$ and available components.

+etwork Topologies# 'uses Some of the simplest and earliest parallel machines used buses.

All processors access a common bus for e#changing data. The distance bet!een an t!o nodes is O1- in a bus. The bus also provides aconvenient broadcast media. 3o!ever$ the band!idth of the shared bus is a ma5or bottlenec&. Tpical bus based machines are limited to dozens of nodes. Sun =nterprise servers andIntel Pentium based shared6bus multiprocessors are e#amples of such architectures.

+etwork Topologies# 'uses

18


19/44

@us6based interconnects 'a( !ith no local cachesN 'b( !ith local memorGcaches.

Since much of the data accessed b processors is local to the processor$ a localmemor can improve the performance of bus6based machines.

+etwork Topologies# Cross.ars

19

A completel non6bloc&ing crossbar net!or& connectingp processors to bmemor ban&s.

A crossbar net!or& uses anp)mgrid of s!itches to connectp inputs to moutputs in a non6bloc&ing manner.


20/44

+etwork Topologies# Cross.ars The cost of a crossbar ofpprocessors gro!s as Op*-. This is generall difficult to scale for large values ofp. =#amples of machines that emplo crossbars include the Sun ltra 3PC +:::: and the?u5itsu FPP::.

+etwork Topologies# Multistage +etworks Crossbars have e#cellent performance scalabilit but poor cost scalabilit. @uses have e#cellent cost scalabilit$ but poor performance scalabilit. Multistage interconnects stri&e a compromise bet!een these e#tremes.

etor/ 'opologies: Multistage etor/s

The schematic of a tpical multistage interconnection net!or&

20


21/44

+etwork Topologies# Multistage Omega +etwork

One of the most commonl used multistage interconnects is the Omega net!or&. This net!or& consists of log pstages$ !herepis the number of inputsGoutputs. At each stage$ input iis connected to output if0


=ach stage of the Omega net!or& implements a perfect shuffle as follo!s0

A perfect shuffle interconnection for eight inputs and outputs.

+etwork Topologies# Multistage Omega +etwork The perfect shuffle patterns are connected using DKD s!itches.

The s!itches operate in t!o modes E crossover or passthrough.

21


22/44

T!o s!itching configurations of the D K D s!itch0'a( Pass6throughN 'b( Cross6over.


A complete Omega net!or& !ith the perfect shuffle interconnects and s!itches can no! beillustrated0

A complete omega net!or& connecting eight inputs and eight outputs.An omega net!or& has p/* ) log ps!itching nodes$ and the cost of such a net!or&

gro!s as p log p-.

+etwork Topologies# Multistage Omega +etwork 0outing Bet s be the binar representation of the source and d be that of the destinationprocessor.

22


23/44

The data traverses the lin& to the first s!itching node. If the most significant bits of sanddare the same$ then the data is routed in pass6through mode b the s!itch else$ it s!itches tocrossover. This process is repeated for each of the log ps!itching stages. ote that this is not a non6bloc&ing s!itch.

+etwork Topologies# Multistage Omega +etwork 0outing

An e#ample of bloc&ing in omega net!or&0 one of the messages':+: to +++ or ++: to +::( is bloc&ed at lin& A@.

+etwork Topologies# Completely Connected +etwork =ach processor is connected to ever other processor. The number of lin&s in the net!or& scales as Op*-. 8hile the performance scales ver !ell$ the hard!are comple#it is not realizable forlarge values ofp. In this sense$ these net!or&s are static counterparts of crossbars.

23


24/44

+etwork Topologies# Completely Connected and Star Connected +etworks

=#ample of an 46node completel connected net!or&.

'a( A completel6connected net!or& of eight nodesN

'b( a star connected net!or& of nine nodes.

+etwork Topologies# Star Connected +etwork

=ver node is connected onl to a common node at the center. "istance bet!een an pair of nodes is O1-.3o!ever$ the central node becomes abottlenec&. In this sense$ star connected net!or&s are static counterparts of buses.

+etwork Topologies# )inear Arrays1 Meshes1 and k-dMeshes

In a linear arra$ each node has t!o neighbors$ one to its left and one to its right. If thenodes at either end are connected$ !e refer to it as a +6" torus or a ring. A generalization to D dimensions has nodes !ith 9 neighbors$ to the north$ south$ east$and !est. A further generalization to ddimensions has nodes !ith *dneighbors. A special case of a d6dimensional mesh is a hpercube. 3ere$ d 0 log p$ !herepis thetotal number of nodes.

+etwork Topologies# )inear Arrays

24


25/44

Binear arras0 'a( !ith no !raparound lin&sN 'b( !ith !raparound lin&.

+etwork Topologies# Two* and Three Dimensional Meshes

T!o and three dimensional meshes0 'a( D6" mesh !ith no !raparoundN 'b( D6" mesh !ith!raparound lin& 'D6" torus(N and 'c( a 6" mesh !ith no !raparound.

+etwork Topologies# (ypercu.es and their Construction

25


26/44

Construction of hpercubes from hpercubes of lo!er dimension.

+etwork Topologies# Properties o! (ypercu.es

The distance bet!een an t!o nodes is at most log p. =ach node has log pneighbors. The distance bet!een t!o nodes is given b the number of bit positions at !hich thet!o nodes differ.

+etwork Topologies# Tree*'ased +etworks

Complete binar tree net!or&s0 'a( a static tree net!or&N and 'b( a dnamic tree net!or&.

+etwork Topologies# Tree Properties The distance bet!een an t!o nodes is no more than *logp.

26


27/44

Bin&s higher up the tree potentiall carr more traffic than those at the lo!er levels. ?or this reason$ a variant called a fat6tree$ fattens the lin&s as !e go up the tree. Trees can be laid out in D" !ith no !ire crossings. This is an attractive propert oftrees.

+etwork Topologies# 2at Trees

A fat tree net!or& of +- processing nodes

valuating Static Interconnection +etworks iameter2The distance bet!een the farthest t!o nodes in the net!or&. The diameter of

a linear arra isp

1$ that of a mesh is *

1-,that of a tree and hpercube is log p$ andthat of a completel connected net!or& is O1-. 3isection 4idth2The minimum number of !ires ou must cut to divide the net!or& intot!o e%ual parts. The bisection !idth of a linear arra and tree is 1$ that of a mesh is $ that ofa hpercube isp/* and that of a completel connected net!or& is pDG9. Cost2 The number of lin&s or s!itches '!hichever is asmptoticall higher( is ameaningful measure of the cost. 3o!ever$ a number of other factors$ such as the abilit tolaout the net!or&$ the length of !ires$ etc.$ also factor in to the cost.

27


28/44

valuating Static Interconnection +etworks

valuating Dynamic Interconnection +etworks

28

8raparound 56ar d6cube

3percube

D6" !raparound mesh

D6" mesh$ no !raparound

Binear arra

Complete binar tree

Star

Completel6connected

Cost'o. oflin&s(

ArcConnectivit

@isection8idth

"iameteret!or&


29/44

Cache Coherence in Multiprocessor Systems

Interconnects provide basic mechanisms for data transfer. In the case of shared address space machines$ additional hard!are is re%uired tocoordinate access to data that might have multiple copies in the net!or&. The underling techni%ue must provide some guarantees on the semantics. This guarantee is generall one of serializabilit$ i.e.$ there e#ists some serial order ofinstruction e#ecution that corresponds to the parallel schedule.

8hen the value of a variable is changes$ all its copies must either be invalidated or updated.

Cache coherence in multiprocessor sstems0 'a( Invalidate protocolN 'b( pdate protocol for shared variables

29

"namic Tree

Omega et!or&

Crossbar

Cost'o. oflin&s(

ArcConnectivit

@isection8idth

"iameteret!or&


30/44

Cache Coherence# ,pdate and Invalidate Protocols If a processor 5ust reads a value once and does not need it again$ an update protocolma generate significant overhead. If t!o processors ma&e interleaved test and updates to a variable$ an update protocol isbetter.

@oth protocols suffer from false sharing overheads 't!o !ords that are not shared$ho!ever$ the lie on the same cache line(. Most current machines use invalidate protocols.

Maintaining Coherence ,sing Invalidate Protocols =ach cop of a data item is associated !ith a state.

One e#ample of such a set of states is$ shared$ invalid$ or dirt. In shared state$ there are multiple valid copies of the data item 'and therefore$ aninvalidate !ould have to be generated on an update(. In dirt state$ onl one cop e#ists and therefore$ no invalidates need to be generated. In invalid state$ the data cop is invalid$ therefore$ a read generates a data re%uest 'andassociated state changes(.

Maintaining Coherence ,sing Invalidate Protocols

State diagram of a simple three6state coherence protocol.

30


31/44

Maintaining Coherence ,sing Invalidate Protocols

=#ample of parallel program e#ecution !ith the simple three6state coherence protocol.

Snoopy Cache Systems3o! are invalidates sent to the right processors7In snoop caches$ there is a broadcast media that listens to all invalidates and read re%uestsand performs appropriate coherence operations locall.

31


32/44

A simple snoop bus based cache coherence sstem

Per!ormance o! Snoopy Caches Once copies of data are tagged dirt$ all subse%uent operations can be performedlocall on the cache !ithout generating e#ternal traffic. If a data item is read b a number of processors$ it transitions to the shared state in thecache and all subse%uent read operations become local. If processors read and update data at the same time$ the generate coherence re%uestson the bus 6 !hich is ultimatel band!idth limited.

Directory 'ased Systems In snoop caches$ each coherence operation is sent to all processors. This is aninherent limitation. 8h not send coherence re%uests to onl those processors that need to be notified7 This is done using a director$ !hich maintains a presence vector for each data item'cache line( along !ith its global state.

Directory 'ased Systems

32


33/44

Architecture of tpical director based sstems0 'a( a centralized directorN and 'b( a distributeddirector.

Per!ormance o! Directory 'ased Schemes The need for a broadcast media is replaced b the director. The additional bits to store the director ma add significant overhead. The underling net!or& must be able to carr all the coherence re%uests. The director is a point of contention$ therefore$ distributed director schemes must beused.

Communication Costs in Parallel Machines Along !ith idling and contention$ communication is a ma5or overhead in parallelprograms. The cost of communication is dependent on a variet of features including the

programming model semantics$ the net!or& topolog$ data handling and routing$ andassociated soft!are protocols.

Message Passing Costs in Parallel Computers The total time to transfer a message over a net!or& comprises of the follo!ing0

E 6tartup time 'ts(0 Time spent at sending and receiving nodes 'e#ecuting the routing algorithm$programming routers$ etc.(.

E 7er(hop time 'th(0 This time is a function of number of hops and includes factors such as s!itchlatencies$ net!or& delas$ etc.

E 7er(word transfer time 'tw(0 This time includes all overheads that are determined b the length ofthe message. This includes band!idth of lin&s$ error chec&ing and correction$ etc.

Store*and*2orward 0outing A message traversing multiple hops is completel received at an intermediate hopbefore being for!arded to the ne#t hop. The total communication cost for a message of size m !ords to traverse lcommunication lin&s is

In most platforms$ this small and the above e#pression can be appro#imated b

33


34/44

0outing Techni3ues

34


35/44

Passing a message from node 7 to 7+ 'a( through a store6and6for!ard communication

net!or&N 'b( and 'c( e#tending the concept to cut6through routing. The shaded regionsrepresent the time that the message is in transit. The startup time associated !ith thismessage transfer is assumed to be zero.

Packet 0outing Store6and6for!ard ma&es poor use of communication resources. Pac&et routing brea&s messages into pac&ets and pipelines them through the net!or&. Since pac&ets ma ta&e different paths$ each pac&et must carr routing information$error chec&ing$ se%uencing$ and other related header information. The total communication time for pac&et routing is appro#imated b0

The factor twaccounts for overheads in pac&et headers.

Cut*Through 0outing Ta&es the concept of pac&et routing to an e#treme b further dividing messages into

basic units called flits. Since flits are tpicall small$ the header information must be minimized. This is done b forcing all flits to ta&e the same path$ in se%uence. A tracer message first programs all intermediate routers. All flits then ta&e the sameroute. =rror chec&s are performed on the entire message$ as opposed to flits. o se%uence numbers are needed.

Cut*Through 0outing The total communication time for cut6through routing is appro#imated b0

This is identical to pac&et routing$ ho!ever$ twis tpicall much smaller.

Simpli!ied Cost Model !or Communicating Messages

35


36/44

The cost of communicating a message bet!een t!o nodes l hops a!a using cut6through routing is given b

In this e#pression$ this tpicall smaller thantsand tw. ?or this reason$ the second termin the


37/44


38/44

m.edding a )inear Array into a (ypercu.eThe function 'is called the binary reflected 'ray code '


39/44

'a( A three6bit reflected Hra code ringN and 'b( its embedding into a three6dimensionalhpercube

m.edding a Mesh into a (ypercu.e A DrK Ds!raparound mesh can be mapped to a D rs6node hpercube b mapping nodei, - of the mesh onto node 'i, r +- QQ ', s 1- of the hpercube '!here QQ denotesconcatenation of the t!o Hra codes(.

m.edding a Mesh into a (ypercu.e

39


40/44

'a( A 9 K 9 mesh illustrating the mapping of mesh nodes to the nodes in a four6dimensionalhpercubeN and 'b( a D K 9 mesh embedded into a three6dimensional hpercube.

Once again$ the congestion$ dilation$ and e#pansion of the mapping is +.

m.edding a Mesh into a )inear Array Since a mesh has more edges than a linear arra$ !e !ill not have an optimalcongestionGdilation mapping. 8e first e#amine the mapping of a linear arra into a mesh and then invert thismapping. This gives us an optimal mapping 'in terms of congestion(.

m.edding a Mesh into a )inear Array# $ample

40


41/44

'a( =mbedding a +- node linear arra into a D6" meshN and 'b( the inverse of the mapping.Solid lines correspond to lin&s in the linear arra and normal lines to lin&s in the mesh.

m.edding a (ypercu.e into a 5*D Mesh =ach node subcube of the hpercube is mapped to a node ro! of the mesh. This is done b inverting the linear6arra to hpercube mapping. This can be sho!n to be an optimal mapping.

m.edding a (ypercu.e into a 5*D Mesh# $ample

Case Studies#The I'M 'lue*4ene Architecture

41


42/44

Case Studies#

The Cray T6 Architecture

Interconnection net!or& of the Cra T=0

'a( node architectureN 'b( net!or& topolog.

Case Studies#The S4I Origin 6777 Architecture

42

The hierarchicalarchitecture of@lue Hene.


43/44

Architecture of the SHI Origin ::: famil of servers

Case Studies#The Sun (PC Server Architecture

43


44/44

Architecture of the Sun =nterprise famil of servers

Documents

Parallel n Distributed Systems