Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
18-742Lecture 1Multiprocessor Computer Architecture
Spring 2005Prof. Babak Falsafihttp://www.ece.cmu.edu/~ece742
Slides developed in part by Profs. Adve, Falsafi, Slides developed in part by Profs. Adve, Falsafi, Hill, Hill, LebeckLebeck, Reinhardt, , Reinhardt, Smith, and Singh of University of Illinois, Carnegie Mellon UnivSmith, and Singh of University of Illinois, Carnegie Mellon University, ersity, University of Wisconsin, Duke University, University of MichiganUniversity of Wisconsin, Duke University, University of Michigan, and , and Princeton University.Princeton University.
Alpha 21364 Shared-Memory Multiprocessor
18-742 2(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Homework 0
• Due by this Friday• My way to learn about you• Learn your name and computer architecture
background
No homeworks graded until homework 0 handed in!!
18-742 3(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Readings
• Reader 1– The book will arrive later. There will be a copies of chapter 1 outside
of A302 after class– Chapter 9 of H,J&S will be on the web
18-742 4(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
18-742 Class Info
• Instructor: Professor Babak Falsafi– URL: http://www.ece.cmu.edu/~babak
• Research interests:– Bridging the processor/memory performance gap– Nanoscale CMOS computer systems– Analytic & simulation tools for design evaluation
• TA: – Brian Gold ([email protected])
• Administrative Assistant: Matt Koeske (koeske@ece)• Class info:
– URL: http://www.ece.cmu.edu/~ece742» Username & password provided in class
– Newsgroup: cmu.ece.class.ece742 (use nntp.ece.cmu.edu)
18-742 5(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
18-742 Class Meeting Time
• Notice!– each lecture is 80 minutes long– class meets between 3:00pm - 4:20pm on MW– office hours: Mon. 2:00-3:00 & Fri. 3:00-4:00 pm – Brian’s office hours: Wed. 11:00-12:00 & Thur. 3:30-4:30
18-742 6(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Who Should Take 18-742?
• Graduate students (MS/IMB/PhD)1. Computer architects to be2. Computer system designers
• Required Background– 18-741 (Advanced Computer Architecture)
• Course expectations:– Heavily discussion oriented– Will loosely follow text (Culler & Singh)– With emphasis on cutting-edge issues/research– Students will review list of research papers (almost weekly)– Independent research project (a major component)– Individual feedback upon request (come to office hours!)
18-742 7(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
18-742: Components
• Text– Parallel Computer Architecture: A Hardware/Software Approach– recommended: Readings in Computer Architecture
• Readers + Review– list of papers: classic + state of the art – short written review per paper
• Homework (individual write-up, group discussion okay)• Project
– mostly original research – groups of two
• Heavy use of web page for interaction!– www.ece.cmu.edu/~ece742
18-742 8(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Grading
• Grade breakdown
Project: 30%Homework: 20%Exam 1: 25% Exam 2: 25%
• Participation + Discussion count toward the Project grade
18-742 9(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Grading (Cont.)
• No late homeworks, no kidding• Group studies are encouraged• Group discussions are encouraged• All homeworks must be results of individual work
(identify the persons you discussed the problem with)
• There is no tolerance for academic dishonesty. Please refer to the University Policy on cheating and plagiarism. Discussion and group studies are encouraged, but all submitted material must be the student's individual work (or in case of the project, individual group work).
18-742 10(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Computer Architecture Curriculum
18-447 — Introduction to computer architecture18-741 — Advanced computer architecture18-742 — Multiprocessor architecture18-743 — Power-aware architecture18-744 — Microarchitecture design and implementation18-747 — Advanced topics
• Related areas:» circuits: VLSI, digital circuit design, CAD» systems: compilers, OS, database systems, networks,
embedded computing, fault-tolerant computing » evaluation: queuing theory, analysis of variance,
confidence intervals, etc.
18-742 11(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Computer Architecture Activity
Computer Architecture Lab at Carnegie Mellon (CALCM) @ www.ece.cmu.edu/CALCM
Send mail to calcm-list-request@ecebody: subscribe calcm-list
Seminars– CALCM weekly seminar– LCS/SDI/Intel weekly seminar– CSSI weekly seminar– Computer systems lecture series
18-742 12(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Outline
• Motivation & Applications
• Theory, History, & A Generic Parallel Machine
• UMA, NUMA, Examples, & More History
• Programming Models– Shared Memory– Message Passing– Data Parallel
• Issues in Programming Models– Function: naming, operations, & ordering– Performance: latency, bandwidth, etc.
18-742 13(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Motivation
• 18-741 is about computers with one processor
• This course uses N processors in a computer to get– Higher Throughput via many jobs in parallel– Improved Cost-Effectiveness (e.g., adding 3 processors may yield
4X throughput for 2X system cost)– To get Lower Latency from shrink-wrapped software
(e.g., databases and web servers today, but more tomorrow)– Lower latency through Parallelizing your application
(but this is hard)
• Need faster than today’s microprocessor?– Wait for tomorrow’s microprocessor– Use many microprocessors in parallel
18-742 14(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Applications: Science and Engineering
• Examples– Weather prediction– Evolution of galaxies– Oil reservoir simulation– Automobile crash tests– Drug development– VLSI CAD– Nuclear BOMBS!
• Typically model physical systems or phenomena• Problems are 2D or 3D• Usually requires “number crunching”• Involves “true” parallelism
18-742 15(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Applications: Commercial
• Examples– On-line transaction processing (OLTP)– Decision support systems (DSS)– “app servers”
• Involves data movement, not much number crunching– OTLP has many small queries– DSS has fewer large queries
• Involves throughput parallelism– inter-query parallelism for OLTP– intra-query parallelism for DSS
18-742 16(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Applications: Multi-media/home
• Examples– speech recognition– data compression/decompression– 3D graphics
• Will become ubiquitous• Involves everything (crunching, data movement, true
parallelism, and throughput parallelism)
18-742 17(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
In Theory
• Sequential– Time to sum n numbers? O(n)– Time to sort n numbers? O(n log n)– What model? RAM
• Parallel– Time to sum? Tree for O(log n)– Time to sort? Non-trivially O(log n)– What model?
» PRAM [Fortune Willie STOC78]» P processors in lock-step» One memory (e.g., CREW for
concurrent read exclusive write)
2<->31<->2 3<->4
2<->31<->2 3<->4
18-742 18(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Perfection: the PRAM Model
• Parallel RAM• Fully shared memory• Unit latency• No memory contention
(unrestricted bandwidth)• Data placement unimportant
Processor Processor Processor. . .
. . .
unitlatency
Interconnection Networkno
Main Memory
contentionno
contention
18-742 19(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Perfection is NOT Achievable
• Latencies grow as the system size grows
• Bandwidths are restricted by memory organizations and interconnection networks
• Dealing with reality leads to division betweenUMA: Uniform Memory Accessand NUMA: Non-Uniform Memory Access
18-742 20(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
But in Practice, How Do You
• Name a datum (e.g. all say a[I])?
• Communicate values?
• Coordinate and synchronize?
• Select processing node size (few-bit ALU to a PC)?
• Select number of nodes in system?
18-742 21(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Historical View
P
M
IO
P
M
IO
P
M
IO
I/O (Network)
Message Passing
P
M
IO
P
M
IO
P
M
IO
Memory
Shared Memory
P
M
IO
P
M
IO
P
M
IO
Processor
(Dataflow/Systolic),Single-Instruction
Single-Data (SIMD)
==> Data Parallel
Join At:
Program With:
18-742 22(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Historical View, cont.
• Machine --> Programming Model– Join at network so program with message passing model– Join at memory so program with shared memory model– Join at processor so program with SIMD or data parallel
• Programming Model --> Machine– Message-passing programs on message-passing machine– Shared-memory programs on shared-memory machine– SIMD/data-parallel programs on SIMD/data-parallel machine
• But– Isn’t hardware basically the same? Processors, memory, & I/O?– Why not have generic parallel machine
& program with model that fits the problem?
18-742 23(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
A Generic Parallel Machine
• Separation of programming models from architectures
• All models require communication
• Node with processor(s), memory, communication assist
Interconnect
CA
MemP
$CA
MemP
$
CA
MemP
$CA
MemP
$
Node 0 Node 1
Node 2 Node 3
18-742 24(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Today’s Parallel Computer Architecture
• Extension of traditional computer architecture to support communication and cooperation
– Communications architecture
Multiprogramming SharedMemory
MessagePassing
DataParallel
UserLevel
ProgrammingModel
Library and CompilerCommunicationAbstractionSystem
Level
Operating SystemSupport
Hardware/SoftwareBoundary
CommunicationHardware
Physical Communication Medium
18-742 25(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Simple Problem
for i = 1 to NA[i] = (A[i] + B[i]) * C[i]sum = sum + A[i]
• How do I make this parallel?
18-742 26(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Simple Problem
for i = 1 to NA[i] = (A[i] + B[i]) * C[i]sum = sum + A[i]
• Split the loops» Independent iterations
for i = 1 to NA[i] = (A[i] + B[i]) * C[i]
for i = 1 to Nsum = sum + A[i]
• Data flow graph?
18-742 27(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Data Flow Graph
A[0] B[0]
+C[0]
*
+
A[1] B[1]
+C[1]
*
A[2] B[2]
+C[2]
*
+
A[3] B[3]
+C[3]
*
+
2 + N-1 cycles to execute on N processorswhat assumptions?
18-742 28(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Partitioning of Data Flow Graph
A[0] B[0]
+C[0]
*
+
A[1] B[1]
+C[1]
*
A[2] B[2]
+C[2]
*
+
A[3] B[3]
+C[3]
*
+global synch
18-742 29(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Programming Model
• Historically, Programming Model == Architecture• Thread(s) of control that operate on data• Provides a communication abstraction that is a
contract between hardware and software (ala ISA)
Current Models• Shared Memory• Message Passing• Data Parallel (Shared Address Space)• (Data Flow)
18-742 30(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Shared (Physical) Memory
• Communication, sharing, and synchronization with store / load on shared variables
• Must map virtual pages to physical page frames
• Consider OS support for good mapping
Pn
P0
load
store
Private Portionof AddressSpace
Shared Portionof AddressSpace
Common PhysicalAddresses
Pn Private
P0 Private
P1 Private
P2 Private
Machine Physical Address Space
18-742 31(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Shared (Physical) Memory on Generic MP
Interconnect
CA
MemP
$CA
MemP
$
CA
MemP
$ CA
MemP
$
Node 0 0,N-1 Node 1 N,2N-1
Node 2 2N,3N-1 Node 3 3N,4N-1
Keep private data and frequently used shared data on same node as computation
18-742 32(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Return of The Simple Problem
private int i, my_start, my_end, mynode;shared float A[N], B[N], C[N], sum;for i = my_start to my_end
A[i] = (A[i] + B[i]) * C[i]GLOBAL_SYNCH;if (mynode == 0)
for i = 1 to Nsum = sum + A[i]
• Can run this on any shared memory machine
18-742 33(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Message Passing Architectures
• Cannot directly access memory on another node
• IBM SP-2, Intel Paragon
• Cluster of workstations
Interconnect
CA
MemP
$CA
MemP
$
Node 0 0,N-1 Node 1 0,N-1
Node 2 0,N-1 Node 3 0,N-1
CA
MemP
$ CA
MemP
$
18-742 34(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Message Passing Programming Model
• User level send/receive abstraction– local buffer (x,y), process (Q,P) and tag (t)– naming and synchronization
Local ProcessAddress Space
address x address y
match
Process P Process Q
Local ProcessAddress Space
Send x, Q, t
Recv y, P, t
18-742 35(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
The Simple Problem Again
int i, my_start, my_end, mynode;float A[N/P], B[N/P], C[N/P], sum;for i = 1 to N/P
A[i] = (A[i] + B[i]) * C[i]sum = sum + A[i]
if (mynode != 0)send (sum,0);
if (mynode == 0)for i = 1 to P-1
recv(tmp,i)sum = sum + tmp
• Send/Recv communicates and synchronizes• P processors
18-742 36(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Separation of Architecture from Model
• At the lowest level SM sends messages– HW is specialized to expedite read/write messages
• What programming model / abstraction is supported at user level?
• Can I have shared-memory abstraction on message passing HW?
• Can I have message passing abstraction on shared memory HW?
18-742 37(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Data Parallel
• Programming Model– operations are performed on each element of a large (regular) data
structure in a single step– arithmetic, global data transfer
• Processor is logically associated with each data element
• Early architectures directly mirrored programming model
– Many, bit serial processors
• Today we have FP units and caches on microprocessors
• Can support data parallel model on SM or MP architecture
18-742 38(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
The Simple Problem Strikes Back
Assuming we have N processorsA = (A + B) * Csum = global_sum (A)
• Language supports array assignment• Special HW support for global operations• CM-2 bit-serial• CM-5 32-bit SPARC processors
– Message Passing and Data Parallel models– Special control network
• Chapter 11…..
18-742 39(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Data Flow Architectures
A[0] B[0]
+C[0]
*
+
A[1] B[1]
+C[1]
*
A[2] B[2]
+C[2]
*
+
A[3] B[3]
+C[3]
*
+Execute Data Flow GraphNo control sequencing
18-742 40(C) 2005 Babak Falsafi from Adve, Falsafi,Hill, Lebeck, Reinhardt, Smith & Singh
Data Flow Architectures
• Explicitly represent data dependencies (Data Flow Graph)
• No artificial constraints, like sequencing instructions!– Early machines had no registers or cache
• Instructions can “fire” when operands are ready– Remember tomasulo’s algorithm
• How do we know when operands are ready?• Matching store
– large associative search!
• Later machines moved to coarser grain (threads)– allowed registers and cache for local computation– introduced messages (with operations and operands)