Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CISC 662 Graduate Computer Architecture
Lecture 1 - Introduction Michela Taufer
http://www.cis.udel.edu/~taufer/courses
Powerpoint Lecture Notes from John Hennessy and David Patterson’s: Computer Architecture, 4th edition
---- Additional teaching material from:
Jelena Mirkovic (U Del) and John Kubiatowicz (UC Berkeley)
2
Course Overview
3
CISC662: Information Instructor: Michela Taufer
Office: 406 Smith Hall Office Hours: TR 10:45AM – 11:45AM or by appt.
TA: Richard Burns Office Hours: TBD
Class: TR 3:30PM - 4:45PM Text: Computer Architecture: A Quantitative Approach,
Forth Edition (2006) Web page: http://www.cis.udel.edu/~taufer/courses
Lectures available in course webpage 24 hours before class
Mailing List: http://gcl.cis.udel.edu/mailman/listinfo/cisc662010_fa09 Email: Instructor: [email protected] or [email protected]
4
CISC 662 Course Focus Understanding the design techniques, machine structures, technology factors, evaluation methods that will determine the form of computers in 21st Century
Technology Programming Languages
Operating Systems History
Applications Interface Design (ISA)
Measurement & Evaluation
Parallelism
Computer Architecture: • Instruction Set Design • Organization • Hardware/Software Boundary Compilers
5
Tentative Topics Coverage Textbook: Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th Ed., 2006
Tentative Schedule: • 2.5 weeks Fundamentals of Computer Architecture,
Instruction Set Architecture • 1.5 weeks Pipelining • 3.0 weeks Instructional Level Parallelism • 1.5 weeks Multiprocessors and Thread-Level Parallelism • 3.0 weeks Memory and Memory Hierarchy
6
Grading • Grade based on:
• Homework assignments • Midterm exam • Final exam • Reading assignments and participation in class (questions and
question discussion)
7
Participation • Complete reading and homework assignments on time • Print and review slides before to come to class
– Slides will be available 24 hours before the class starts
8
Getting Help • Course webpage at http://cis.udel.edu/~taufer/courses
– Copies of lectures and project assignments – Clarifications to assignments, deadlines – Syllabus and class schedule including deadlines – User: cisc662student – Password: Study4Fun!
• Discussions through mailing list – Clarifications to assignments, general discussion – Send e-mail to all with:
[email protected] – You MUST register to the list at:
http://gcl.cis.udel.edu/mailman/listinfo/cisc662010_fa09 • Personal help
– Benefit from office hours
9
Cheating • What is cheating?
– Sharing code: either by copying, retyping, looking at, or supplying a copy of a file.
• What is NOT cheating? – Helping others use systems or tools. – Helping others with high-level design issues. – Helping others debug their code.
• Penalty for cheating: – Removal from course with failing grade.
10
Concepts in Architecture
11
What’s Inside a Computer?
Instruction Decoder
Memory
Cache
Main memory
Disk
ALU
Input/Output Units
CPU
Clock
I/O controller
12
What Each Unit Does? Memory
Cache
Main memory
Disk
I/O Units
CPU
a+b=c print c
a+b=c
a, b
print c
c
32
I/O controller
Instruction Decoder
ALU
CPU
Clock
13
What is “Computer Architecture”? Applications
Instruction Set Architecture
Compiler
Operating System
Firmware
• Coordination of many levels of abstraction • Under a rapidly changing set of forces • Design, Measurement, and Evaluation
I/O system Instr. Set Proc.
Digital Design Circuit Design
Datapath & Control
Layout & fab
Semiconductor Materials
14
Technology constantly on the move! • All major manufacturers have
announced and/or are shipping multi-core processor chips
• Intel talking about 80 cores in not-to-distance future
• 3-dimensional chip technology – Sandwiches of silicon – “Through-Vias” for communication
• Number of transistors/dice keeps increasing
– Intel Core 2: 65nm, 291 Million transistors! – Intel Pentium D 900: 65nm, 376 Million
Transistors!
Intel Core Duo
15
Moore’s Law
• “Cramming More Components onto Integrated Circuits” – Gordon Moore, Electronics, 1965
• # on transistors on cost-effective integrated circuit double every 18 months
16
Computer Architecture’s Changing Definition
• 1950s to 1960s: Computer Architecture Course: Computer Arithmetic
• 1970s to mid 1980s: Computer Architecture Course: Instruction Set Design, especially ISA appropriate for compilers
• 1990s: Computer Architecture Course: Design of CPU, memory system, I/O system, Multiprocessors, Networks
• 2000s: Multi-core design, on-chip networking, parallel programming paradigms, power reduction
• 2010s: Computer Architecture Course: Self adapting systems? Self organizing structures? DNA Systems/Quantum Computing?
17
The Instruction Set: a Critical Interface
instruction set
software
hardware
• Properties of a good abstraction – Lasts through many generations (portability) – Used in many different ways (generality) – Provides convenient functionality to higher levels – Permits an efficient implementation at lower levels
18
Instruction Set Architecture ... the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation.
– Amdahl, Blaaw, and Brooks, 1964
-- Organization of Programmable Storage
-- Data Types & Data Structures: Encodings & Representations
-- Instruction Formats
-- Instruction (or Operation Code) Set
-- Modes of Addressing and Accessing Data Items and Instructions
-- Exceptional Conditions
19
Computer Architecture is an Integrated Approach
• What really matters is the functioning of the complete system
– hardware, runtime system, compiler, operating system, and application
– In networking, this is called the “End to End argument”
• Computer architecture is not just about transistors, individual instructions, or particular implementations
– E.g., Original RISC projects replaced complex instructions with a compiler + simple instructions
• It is very important to think across all hardware/software boundaries
– New technology ⇒ New Capabilities ⇒ New Architectures ⇒ New Tradeoffs
– Delicate balance between backward compatibility and efficiency
20
Elements of an ISA • Set of machine-recognized data types
– bytes, words, integers, floating point, strings, . . .
• Operations performed on those data types – Add, sub, mul, div, xor, move, ….
• Programmable storage – regs, PC, memory
• Methods of identifying and obtaining data referenced by instructions (addressing modes)
– Literal, reg., absolute, relative, reg + offset, …
• Format (encoding) of the instructions – Op code, operand fields, …
21
Example: MIPS R3000 0 r0
r1 ° ° ° r31 PC lo hi
Programmable storage 2^32 x bytes 31 x 32-bit GPRs (R0=0) 32 x 32-bit FP regs (paired DP) HI, LO, PC
Data types ? Format ? Addressing Modes?
Arithmetic logical Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV
Memory Access LB, LBU, LH, LHU, LW, LWL,LWR SB, SH, SW, SWL, SWR
Control J, JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL
32-bit instructions on word boundary
22
ISA vs. Computer Architecture • Old definition of computer architecture
= instruction set design – Other aspects of computer design called implementation – Insinuates implementation is uninteresting or less challenging
• Our view is computer architecture >> ISA • Architect’s job much more than instruction set
design; technical hurdles today more challenging than those in instruction set design
• Since instruction set design not where action is, some conclude computer architecture (using old definition) is not where action is
– We disagree on conclusion – Agree that ISA not where action is (ISA in CA:AQA 4/e appendix)
23
Computer Architecture Topics
Instruction Set Architecture
Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, Dynamic Compilation
Addressing, Protection, Exception Handling
L1 Cache
L2 Cache
DRAM
Disks, WORM, Tape
Coherence, Bandwidth, Latency
Emerging Technologies Interleaving Bus protocols
RAID
VLSI
Input/Output and Storage
Memory Hierarchy
Pipelining and Instruction Level Parallelism
Network Communication
Oth
er P
roce
ssor
s
24
Computer Architecture Topics
M
Interconnection Network S
P M P M P M P ° ° °
Topologies, Routing, Bandwidth, Latency, Reliability
Network Interfaces
Shared Memory, Message Passing, Data Parallelism
Processor-Memory-Switch
Multiprocessors Networks and Interconnections
25
Fundamental Execution Cycle
Instruction Fetch
Instruction Decode
Operand Fetch
Execute
Result Store
Next Instruction
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Compute result value or status
Deposit results in storage for later use
Determine successor instruction
Processor
regs
F.U.s
Memory
program
Data
von Neuman
bottleneck
26
Pipelined Instruction Execution
I n s t r.
O r d e r
Time (clock cycles)
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5
27
Limits to Pipelining
• Maintain the von Neumann “illusion” of one instruction at a time execution
• Hazards prevent next instruction from executing during its designated clock cycle
– Structural hazards: attempt to use the same hardware to do two different things at once
– Data hazards: Instruction depends on result of prior instruction still in the pipeline
– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
• Power: Too many thing happening at once ⇒ Melt your chip!
– Must disable parts of the system that are not being used – Clock Gating, Asynchronous Design, Low Voltage Swings, …
28
Progression of ILP • 1st generation RISC - pipelined
– Full 32-bit processor fit on a chip => issue almost 1 IPC » Need to access memory 1+x times per cycle
– Floating-Point unit on another chip – Cache controller a third, off-chip cache – 1 board per processor multiprocessor systems
• 2nd generation: superscalar – Processor and floating point unit on chip (and some cache) – Issuing only one instruction per cycle uses at most half – Fetch multiple instructions, issue couple
» Grows from 2 to 4 to 8 … – How to manage dependencies among all these instructions? – Where does the parallelism come from?
• VLIW (Very Long Instruction Word) – Expose some of the ILP to compiler, allow it to schedule
instructions to reduce dependences
29
Modern ILP • Dynamically scheduled, out-of-order execution
– Current microprocessor fetch 10s of instructions per cycle – Pipelines are 10s of cycles deep ⇒ many 10s of instructions in execution at once
• What happens: – Grab a bunch of instructions, determine all their dependences,
eliminate dep’s wherever possible, throw them all into the execution unit, let each one move forward as its dependences are resolved
– Appears as if executed sequentially – On a trap or interrupt, capture the state of the machine
between instructions perfectly
• Huge complexity – Complexity of many components scales as n2 (issue width) – Power consumption big problem
30
Have we reached the end of ILP? • Multiple processor easily fit on a chip • Every major microprocessor vendor
has gone to multithreading – Thread: loci of control, execution context – Fetch instructions from multiple threads at once,
throw them all into the execution unit – Intel: hyperthreading – Concept has existed in high performance computing
for 20 years (or is it 40? CDC6600)
• Vector processing – Each instruction processes many distinct data – Ex: MMX
• Raise the level of architecture – many processors per chip
Tensilica Configurable Proc
31
The Memory Abstraction
• Association of <name, value> pairs – typically named as byte addresses – often values aligned on multiples of size
• Sequence of Reads and Writes • Write binds a value to an address • Read of addr returns most recently written
value bound to that address
address (name) command (R/W)
data (W)
data (R)
done
32
µProc 60%/yr. (2X/1.5yr)
DRAM 9%/yr. (2X/10 yrs)
1
10
100
1000 19
80
1981
1983
19
84
1985
19
86
1987
19
88
1989
19
90
1991
19
92
1993
19
94
1995
19
96
1997
19
98
1999
20
00
DRAM
CPU 19
82
Processor-Memory Performance Gap: (grows 50% / year)
Perf
orman
ce
Time
Processor-DRAM Memory Gap (latency)
33
Levels of the Memory Hierarchy
CPU Registers 100s Bytes << 1s ns
Cache 10s-100s K Bytes ~1 ns $1s/ MByte
Main Memory M Bytes 100ns- 300ns $< 1/ MByte
Disk 10s G Bytes, 10 ms (10,000,000 ns) $0.001/ MByte
Capacity Access Time Cost
Tape infinite sec-min $0.0014/ MByte
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
Staging Xfer Unit
prog./compiler 1-8 bytes
cache cntl 8-128 bytes
OS 512-4K bytes
user/operator Mbytes
Upper Level
Lower Level
faster
Larger
circa 1995 numbers
34
The Principle of Locality • The Principle of Locality:
– Program access a relatively small portion of the address space at any instant of time.
• Two Different Types of Locality: – Temporal Locality (Locality in Time): If an item is referenced, it will tend to
be referenced again soon (e.g., loops, reuse) – Spatial Locality (Locality in Space): If an item is referenced, items whose
addresses are close by tend to be referenced soon (e.g., straightline code, array access)
• Last 30 years, HW relied on locality for speed
P MEM $
35
Memory Abstraction and Parallelism • Maintaining the illusion of sequential access to memory
across distributed system • What happens when multiple processors access the same
memory at once? – Do they see a consistent picture?
• Processing and processors embedded in the memory?
P 1
$
Inter connection network
$
P n
Mem Mem
P 1
$
Inter connection network
$
P n
Mem Mem
Deadlines and News News: • 08-20-09: Syllabus is now available • 08-20-09: Passwd for slide and lab web-pages
was given in class Lecture Topics: • 09-03-09: Lec02 - Performance and ISAs • 09-01-09: Lec01 - Course presentation and
assessment test (today) Reading Assignments: • 09-01-09: Chapter 1 and Appendix B of the book Deadlines: • 09-07-09 (11:59PM): Questions 1 on Chapter 1
36