40
18-742 Lecture 1 Multiprocessor Computer Architecture Spring 2005 Prof. Babak Falsafi http://www.ece.cmu.edu/~ece742 Slides developed in part by Profs. Adve, Falsafi, Slides developed in part by Profs. Adve, Falsafi, Hill, Hill, Lebeck Lebeck , Reinhardt, , Reinhardt, Smith, and Singh of University of Illinois, Carnegie Mellon Univ Smith, and Singh of University of Illinois, Carnegie Mellon Univ ersity, ersity, University of Wisconsin, Duke University, University of Michigan University of Wisconsin, Duke University, University of Michigan , and , and Princeton University. Princeton University. Alpha 21364 Shared-Memory Multiprocessor

18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742Lecture 1Multiprocessor Computer Architecture

Spring 2005Prof. Babak Falsafihttp://www.ece.cmu.edu/~ece742

Slides developed in part by Profs. Adve, Falsafi, Slides developed in part by Profs. Adve, Falsafi, Hill, Hill, LebeckLebeck, Reinhardt, , Reinhardt, Smith, and Singh of University of Illinois, Carnegie Mellon UnivSmith, and Singh of University of Illinois, Carnegie Mellon University, ersity, University of Wisconsin, Duke University, University of MichiganUniversity of Wisconsin, Duke University, University of Michigan, and , and Princeton University.Princeton University.

Alpha 21364 Shared-Memory Multiprocessor

Page 2: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 2(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Homework 0

• Due by this Friday• My way to learn about you• Learn your name and computer architecture

background

No homeworks graded until homework 0 handed in!!

Page 3: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 3(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Readings

• Reader 1– The book will arrive later. There will be a copies of chapter 1 outside

of A302 after class– Chapter 9 of H,J&S will be on the web

Page 4: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 4(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

18-742 Class Info

• Instructor: Professor Babak Falsafi– URL: http://www.ece.cmu.edu/~babak

• Research interests:– Bridging the processor/memory performance gap– Nanoscale CMOS computer systems– Analytic & simulation tools for design evaluation

• TA: – Brian Gold ([email protected])

• Administrative Assistant: Matt Koeske (koeske@ece)• Class info:

– URL: http://www.ece.cmu.edu/~ece742» Username & password provided in class

– Newsgroup: cmu.ece.class.ece742 (use nntp.ece.cmu.edu)

Page 5: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 5(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

18-742 Class Meeting Time

• Notice!– each lecture is 80 minutes long– class meets between 3:00pm - 4:20pm on MW– office hours: Mon. 2:00-3:00 & Fri. 3:00-4:00 pm – Brian’s office hours: Wed. 11:00-12:00 & Thur. 3:30-4:30

Page 6: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 6(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Who Should Take 18-742?

• Graduate students (MS/IMB/PhD)1. Computer architects to be2. Computer system designers

• Required Background– 18-741 (Advanced Computer Architecture)

• Course expectations:– Heavily discussion oriented– Will loosely follow text (Culler & Singh)– With emphasis on cutting-edge issues/research– Students will review list of research papers (almost weekly)– Independent research project (a major component)– Individual feedback upon request (come to office hours!)

Page 7: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 7(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

18-742: Components

• Text– Parallel Computer Architecture: A Hardware/Software Approach– recommended: Readings in Computer Architecture

• Readers + Review– list of papers: classic + state of the art – short written review per paper

• Homework (individual write-up, group discussion okay)• Project

– mostly original research – groups of two

• Heavy use of web page for interaction!– www.ece.cmu.edu/~ece742

Page 8: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 8(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Grading

• Grade breakdown

Project: 30%Homework: 20%Exam 1: 25% Exam 2: 25%

• Participation + Discussion count toward the Project grade

Page 9: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 9(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Grading (Cont.)

• No late homeworks, no kidding• Group studies are encouraged• Group discussions are encouraged• All homeworks must be results of individual work

(identify the persons you discussed the problem with)

• There is no tolerance for academic dishonesty. Please refer to the University Policy on cheating and plagiarism. Discussion and group studies are encouraged, but all submitted material must be the student's individual work (or in case of the project, individual group work).

Page 10: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 10(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Computer Architecture Curriculum

18-447 — Introduction to computer architecture18-741 — Advanced computer architecture18-742 — Multiprocessor architecture18-743 — Power-aware architecture18-744 — Microarchitecture design and implementation18-747 — Advanced topics

• Related areas:» circuits: VLSI, digital circuit design, CAD» systems: compilers, OS, database systems, networks,

embedded computing, fault-tolerant computing » evaluation: queuing theory, analysis of variance,

confidence intervals, etc.

Page 11: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 11(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Computer Architecture Activity

Computer Architecture Lab at Carnegie Mellon (CALCM) @ www.ece.cmu.edu/CALCM

Send mail to calcm-list-request@ecebody: subscribe calcm-list

Seminars– CALCM weekly seminar– LCS/SDI/Intel weekly seminar– CSSI weekly seminar– Computer systems lecture series

Page 12: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 12(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Outline

• Motivation & Applications

• Theory, History, & A Generic Parallel Machine

• UMA, NUMA, Examples, & More History

• Programming Models– Shared Memory– Message Passing– Data Parallel

• Issues in Programming Models– Function: naming, operations, & ordering– Performance: latency, bandwidth, etc.

Page 13: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 13(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Motivation

• 18-741 is about computers with one processor

• This course uses N processors in a computer to get– Higher Throughput via many jobs in parallel– Improved Cost-Effectiveness (e.g., adding 3 processors may yield

4X throughput for 2X system cost)– To get Lower Latency from shrink-wrapped software

(e.g., databases and web servers today, but more tomorrow)– Lower latency through Parallelizing your application

(but this is hard)

• Need faster than today’s microprocessor?– Wait for tomorrow’s microprocessor– Use many microprocessors in parallel

Page 14: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 14(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Applications: Science and Engineering

• Examples– Weather prediction– Evolution of galaxies– Oil reservoir simulation– Automobile crash tests– Drug development– VLSI CAD– Nuclear BOMBS!

• Typically model physical systems or phenomena• Problems are 2D or 3D• Usually requires “number crunching”• Involves “true” parallelism

Page 15: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 15(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Applications: Commercial

• Examples– On-line transaction processing (OLTP)– Decision support systems (DSS)– “app servers”

• Involves data movement, not much number crunching– OTLP has many small queries– DSS has fewer large queries

• Involves throughput parallelism– inter-query parallelism for OLTP– intra-query parallelism for DSS

Page 16: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 16(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Applications: Multi-media/home

• Examples– speech recognition– data compression/decompression– 3D graphics

• Will become ubiquitous• Involves everything (crunching, data movement, true

parallelism, and throughput parallelism)

Page 17: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 17(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

In Theory

• Sequential– Time to sum n numbers? O(n)– Time to sort n numbers? O(n log n)– What model? RAM

• Parallel– Time to sum? Tree for O(log n)– Time to sort? Non-trivially O(log n)– What model?

» PRAM [Fortune Willie STOC78]» P processors in lock-step» One memory (e.g., CREW for

concurrent read exclusive write)

2<->31<->2 3<->4

2<->31<->2 3<->4

Page 18: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 18(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Perfection: the PRAM Model

• Parallel RAM• Fully shared memory• Unit latency• No memory contention

(unrestricted bandwidth)• Data placement unimportant

Processor Processor Processor. . .

. . .

unitlatency

Interconnection Networkno

Main Memory

contentionno

contention

Page 19: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 19(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Perfection is NOT Achievable

• Latencies grow as the system size grows

• Bandwidths are restricted by memory organizations and interconnection networks

• Dealing with reality leads to division betweenUMA: Uniform Memory Accessand NUMA: Non-Uniform Memory Access

Page 20: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 20(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

But in Practice, How Do You

• Name a datum (e.g. all say a[I])?

• Communicate values?

• Coordinate and synchronize?

• Select processing node size (few-bit ALU to a PC)?

• Select number of nodes in system?

Page 21: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 21(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Historical View

P

M

IO

P

M

IO

P

M

IO

I/O (Network)

Message Passing

P

M

IO

P

M

IO

P

M

IO

Memory

Shared Memory

P

M

IO

P

M

IO

P

M

IO

Processor

(Dataflow/Systolic),Single-Instruction

Single-Data (SIMD)

==> Data Parallel

Join At:

Program With:

Page 22: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 22(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Historical View, cont.

• Machine --> Programming Model– Join at network so program with message passing model– Join at memory so program with shared memory model– Join at processor so program with SIMD or data parallel

• Programming Model --> Machine– Message-passing programs on message-passing machine– Shared-memory programs on shared-memory machine– SIMD/data-parallel programs on SIMD/data-parallel machine

• But– Isn’t hardware basically the same? Processors, memory, & I/O?– Why not have generic parallel machine

& program with model that fits the problem?

Page 23: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 23(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

A Generic Parallel Machine

• Separation of programming models from architectures

• All models require communication

• Node with processor(s), memory, communication assist

Interconnect

CA

MemP

$CA

MemP

$

CA

MemP

$CA

MemP

$

Node 0 Node 1

Node 2 Node 3

Page 24: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 24(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Today’s Parallel Computer Architecture

• Extension of traditional computer architecture to support communication and cooperation

– Communications architecture

Multiprogramming SharedMemory

MessagePassing

DataParallel

UserLevel

ProgrammingModel

Library and CompilerCommunicationAbstractionSystem

Level

Operating SystemSupport

Hardware/SoftwareBoundary

CommunicationHardware

Physical Communication Medium

Page 25: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 25(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Simple Problem

for i = 1 to NA[i] = (A[i] + B[i]) * C[i]sum = sum + A[i]

• How do I make this parallel?

Page 26: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 26(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Simple Problem

for i = 1 to NA[i] = (A[i] + B[i]) * C[i]sum = sum + A[i]

• Split the loops» Independent iterations

for i = 1 to NA[i] = (A[i] + B[i]) * C[i]

for i = 1 to Nsum = sum + A[i]

• Data flow graph?

Page 27: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 27(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Data Flow Graph

A[0] B[0]

+C[0]

*

+

A[1] B[1]

+C[1]

*

A[2] B[2]

+C[2]

*

+

A[3] B[3]

+C[3]

*

+

2 + N-1 cycles to execute on N processorswhat assumptions?

Page 28: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 28(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Partitioning of Data Flow Graph

A[0] B[0]

+C[0]

*

+

A[1] B[1]

+C[1]

*

A[2] B[2]

+C[2]

*

+

A[3] B[3]

+C[3]

*

+global synch

Page 29: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 29(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Programming Model

• Historically, Programming Model == Architecture• Thread(s) of control that operate on data• Provides a communication abstraction that is a

contract between hardware and software (ala ISA)

Current Models• Shared Memory• Message Passing• Data Parallel (Shared Address Space)• (Data Flow)

Page 30: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 30(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Shared (Physical) Memory

• Communication, sharing, and synchronization with store / load on shared variables

• Must map virtual pages to physical page frames

• Consider OS support for good mapping

Pn

P0

load

store

Private Portionof AddressSpace

Shared Portionof AddressSpace

Common PhysicalAddresses

Pn Private

P0 Private

P1 Private

P2 Private

Machine Physical Address Space

Page 31: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 31(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Shared (Physical) Memory on Generic MP

Interconnect

CA

MemP

$CA

MemP

$

CA

MemP

$ CA

MemP

$

Node 0 0,N-1 Node 1 N,2N-1

Node 2 2N,3N-1 Node 3 3N,4N-1

Keep private data and frequently used shared data on same node as computation

Page 32: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 32(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Return of The Simple Problem

private int i, my_start, my_end, mynode;shared float A[N], B[N], C[N], sum;for i = my_start to my_end

A[i] = (A[i] + B[i]) * C[i]GLOBAL_SYNCH;if (mynode == 0)

for i = 1 to Nsum = sum + A[i]

• Can run this on any shared memory machine

Page 33: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 33(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Message Passing Architectures

• Cannot directly access memory on another node

• IBM SP-2, Intel Paragon

• Cluster of workstations

Interconnect

CA

MemP

$CA

MemP

$

Node 0 0,N-1 Node 1 0,N-1

Node 2 0,N-1 Node 3 0,N-1

CA

MemP

$ CA

MemP

$

Page 34: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 34(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Message Passing Programming Model

• User level send/receive abstraction– local buffer (x,y), process (Q,P) and tag (t)– naming and synchronization

Local ProcessAddress Space

address x address y

match

Process P Process Q

Local ProcessAddress Space

Send x, Q, t

Recv y, P, t

Page 35: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 35(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

The Simple Problem Again

int i, my_start, my_end, mynode;float A[N/P], B[N/P], C[N/P], sum;for i = 1 to N/P

A[i] = (A[i] + B[i]) * C[i]sum = sum + A[i]

if (mynode != 0)send (sum,0);

if (mynode == 0)for i = 1 to P-1

recv(tmp,i)sum = sum + tmp

• Send/Recv communicates and synchronizes• P processors

Page 36: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 36(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Separation of Architecture from Model

• At the lowest level SM sends messages– HW is specialized to expedite read/write messages

• What programming model / abstraction is supported at user level?

• Can I have shared-memory abstraction on message passing HW?

• Can I have message passing abstraction on shared memory HW?

Page 37: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 37(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Data Parallel

• Programming Model– operations are performed on each element of a large (regular) data

structure in a single step– arithmetic, global data transfer

• Processor is logically associated with each data element

• Early architectures directly mirrored programming model

– Many, bit serial processors

• Today we have FP units and caches on microprocessors

• Can support data parallel model on SM or MP architecture

Page 38: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 38(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

The Simple Problem Strikes Back

Assuming we have N processorsA = (A + B) * Csum = global_sum (A)

• Language supports array assignment• Special HW support for global operations• CM-2 bit-serial• CM-5 32-bit SPARC processors

– Message Passing and Data Parallel models– Special control network

• Chapter 11…..

Page 39: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 39(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Data Flow Architectures

A[0] B[0]

+C[0]

*

+

A[1] B[1]

+C[1]

*

A[2] B[2]

+C[2]

*

+

A[3] B[3]

+C[3]

*

+Execute Data Flow GraphNo control sequencing

Page 40: 18-742 Lecture 1 Multiprocessor Computer Architecturectho.org/toread/lectures/18742/L01-intro.pdf · 18-742 7 (C) 2005 Babak Falsafi from Adve, Falsafi, Hill, Lebeck, Reinhardt, Smith

18-742 40(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh

Data Flow Architectures

• Explicitly represent data dependencies (Data Flow Graph)

• No artificial constraints, like sequencing instructions!– Early machines had no registers or cache

• Instructions can “fire” when operands are ready– Remember tomasulo’s algorithm

• How do we know when operands are ready?• Matching store

– large associative search!

• Later machines moved to coarser grain (threads)– allowed registers and cache for local computation– introduced messages (with operations and operands)