Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Qiang Xu CUHK, Fall 2013
Part.1 .1
ENGG 5101
Advanced Computer Architecture
Lecture 01 - Introduction XU, Qiang (Johnny) 徐強
Qiang Xu CUHK, Fall 2013
Part.1 .2
Course Genereal Information ¨ Instructor: Qiang Xu
* http://www.cse.cuhk.edu.hk/~qxu * Office hours: 1-3pm, Tuesday
¨ Course Info * http://www.cse.cuhk.edu.hk/engg5101
¨ TA: Zelong Sun * [email protected]
¨ Check student/faculty expectations on teaching and learning (available on course webpage)
Qiang Xu CUHK, Fall 2013
Part.1 .3
Course Objective ¨ Learn the organizational paradigms that determine
the capabilities, performance, power consumption and reliability of computer systems * The what, the how, and more importantly, the why * Processor microarchitecture * Memory hierarchies and cache coherence
¨ Focus on parallel organization and design, e.g., superscalar/VLIW and multiprocessors
¨ Learn how to read and evaluate research papers
Qiang Xu CUHK, Fall 2013
Part.1 .4
Prerequisites
¨ Basic courses in * Digital Design * Hardware Organization/Computer Architecture
Qiang Xu CUHK, Fall 2013
Part.1 .5
References
¨ Reference Books * M. Dubois, M. Annavaram, P. Stenstrom, Parallel
Computer Organization and Design, Cambridge, 2012. * J. Hennessy and D. Patterson, Computer Architecture, A
Quantitative Approach, 5th ed., Morgan-Kaufman 2012.
¨ Papers listed on the course webpage
Acknowledgement: Some slides adapted from reference slides of these books
Qiang Xu CUHK, Fall 2013
Part.1 .6
Course Structure and Grading Scheme
¨ Lectures: * 2 weeks review of basic concepts and scalar processor * 2 weeks on advanced single-core processor * 1 week on memory systems * 1 week on power and reliability * 3 weeks on multiprocessor systems * 1 week on future trends
¨ What you need to do? * homework assignments – 20% * 2 research essays (by group) - 30% * midterm and final exam – 50% * final exam grade must exceed 50 (out of 100) to pass!
This course is NOT for everyone!!!
Qiang Xu CUHK, Fall 2013
Part.1 .7
What’re your Choices for Computing?
Ener
gy E
fficienc
y (in
MOPS
/mW
)
Flexibility (or application scope)
0.1-1
1-10
10-100
100-1000
None Fully flexible
Somewhat flexible
Har
dwired
cus
tom
Conf
igur
able/P
aram
eter
izab
le
Dom
ain-
spec
ific p
roce
ssor
(e
.g., G
PU, DSP
)
micro
proc
esso
r
Qiang Xu CUHK, Fall 2013
Part.1 .8
Layered View of Computer Systems
Qiang Xu CUHK, Fall 2013
Part.1 .9
Von Neumann Architecture
John von Neumann “the last of the great
mathematicians”
Alan Turing “Father of Computer
Science and AI”
Qiang Xu CUHK, Fall 2013
Part.1 .10
Below the Program
¨ High-level language program (in C) swap (int v[], int k) (int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; )
¨ Assembly language program (for MIPS) swap: sll $2, $5, 2 add $2, $4, $2 lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) jr $31
¨ Machine (object) code (for MIPS) 000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000
. . .
C compiler
assembler
one-to-many
one-to-one
Qiang Xu CUHK, Fall 2013
Part.1 .11
Input Device Inputs Object Code
Processor
Control
Datapath
Memory
000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000
Devices
Input
Output
Network
Qiang Xu CUHK, Fall 2013
Part.1 .12
Object Code Stored in Memory
Processor
Control
Datapath
Memory
000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000
Devices
Input
Output
Network
Qiang Xu CUHK, Fall 2013
Part.1 .13
Processor Fetches an Instruction
Processor
Control
Datapath
Memory
000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000
Processor fetches an instruction from memory
Devices
Input
Output
Network
Qiang Xu CUHK, Fall 2013
Part.1 .14
Control Decodes the Instruction
Processor
Control
Datapath
Memory
000000 00100 00010 0001000000100000
Control decodes the instruction to determine what to execute
Devices
Input
Output
Network
Qiang Xu CUHK, Fall 2013
Part.1 .15
Datapath Executes the Instruction
Processor
Control
Datapath
Memory
contents Reg #4 ADD contents Reg #2
results put in Reg #2
Datapath executes the instruction as directed by control
000000 00100 00010 0001000000100000
Devices
Input
Output
Network
Qiang Xu CUHK, Fall 2013
Part.1 .16
What Happens Next?
Processor
Control
Datapath
Memory
000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000
Fetch
Decode Exec
Devices
Input
Output
Network
Qiang Xu CUHK, Fall 2013
Part.1 .17
Processor Fetches the Next Instruction
Processor
Control
Datapath
Memory
000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000
Processor fetches the next instruction from memory
How does it know which location in memory to fetch from next?
Devices
Input
Output
Network
Qiang Xu CUHK, Fall 2013
Part.1 .18
Output Data Stored in Memory
Processor
Control
Datapath
Memory
00000100010100000000000000000000 00000000010011110000000000000100 00000011111000000000000000001000
At program completion the data to be output resides in memory
Devices
Input
Output
Network
Qiang Xu CUHK, Fall 2013
Part.1 .19
Output Device Outputs Data
Processor
Control
Datapath
Memory
00000100010100000000000000000000 00000000010011110000000000000100 00000011111000000000000000001000
Devices
Input
Output
Network
Qiang Xu CUHK, Fall 2013
Part.1 .20
What Differentiates Various Computer Architecture?
¨ The conceptual design and fundamental operational structure of a computer system * Instruction set architecture (ISA)
» Programming model of a processor » Instructions, data types, registers, addressing modes, etc. » Not many ISAs survive over the years
* Microarchitecture » How to implement the ISA at high-level » Pipelining, cache, branch prediction, superscalar, out-of-
order execution, register renaming, multi-this, multi-that, etc.
Qiang Xu CUHK, Fall 2013
Part.1 .21
Modern PC Architecture
Qiang Xu CUHK, Fall 2013
Part.1 .22
Modern Smartphone Architecture
From: TI website
Qiang Xu CUHK, Fall 2013
Part.1 .23
Generic Parallel Compute Architecture
Qiang Xu CUHK, Fall 2013
Part.1 .24
Moore’s Law for CPUs and DRAMs
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.!
Qiang Xu CUHK, Fall 2013
Part.1 .25
Main driver: device scaling ...
From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.!
Qiang Xu CUHK, Fall 2013
Part.1 .26
Secondary driver: Wafer size
From: “Facing the Hot Chips Challenge Again”, Bill Holt,
Intel, presented at Hot Chips 17, 2005.!
Qiang Xu CUHK, Fall 2013
Part.1 .27
Intel Core i7 Processor
45nm technology, 18.9mm x 13.6mm, 0.73billion transistors, 2008
Qiang Xu CUHK, Fall 2013
Part.1 .28
Intel Core i7 Processor
Qiang Xu CUHK, Fall 2013
Part.1 .29
Highest Clock Rate of Intel Processors
» Due to process improvements » Deeper pipeline » Circuit design techniques
What if the exponential increase had kept up? Why not?
Qiang Xu CUHK, Fall 2013
Part.1 .30
What will happen??
Qiang Xu CUHK, Fall 2013
Part.1 .31
Power Density (if Increasing Clock Rate Exponentially as Before)
4004 8008 8080
8085
8086
286 386 486
Pentium® proc P6
1
10
100
1000
10000
1970 1980 1990 2000 2010 Year
Powe
r Den
sity
(W
/cm2)
Hot Plate
Nuclear Reactor
Rocket Nozzle
Power density too high to keep junctions at low temp
Courtesy, Intel
Qiang Xu CUHK, Fall 2013
Part.1 .32
POWER is the King Now!
¨ Total power = Dynamic power + Static Power Pdynamic = αCV2f
Pstatic = VIsub ≈ Ve-KVt/T
Qiang Xu CUHK, Fall 2013
Part.1 .33
Hitting “Power Wall” - Go for Multi-Core
P. Gargini Intel Developer’s Forum 2005
Qiang Xu CUHK, Fall 2013
Part.1 .34
Parallel Computing for Higher Performance
¨ Classes of parallelism: * Instruction-Level Parallelism (ILP)
» Pipelining, Superscalar, VLIW, EPIC * Data-Level Parallelism (DLP)
» Vector architectures, GPU, SIMD extension for multimedia * Thread-Level Parallelism
» SMT, Multiprocessor * Request-Level Parallelism
» Warehouse-scale computing
Qiang Xu CUHK, Fall 2013
Part.1 .35
Amdahl’s Law
¨ Lessons learned * Focus on the common case in design!! * the law of diminishing returns
¨ In practice, super-linear acceleration is observed in some rare cases, how is that possible?
1-F F
Apply enhancement
1-F F/S
without E
with E
Speedup = 1
(1− F)+ FS
< 11− F
Qiang Xu CUHK, Fall 2013
Part.1 .36
Gustafson’s Law ¨ When more cores are available, the workloads are also
growing * Let us start with the execution time on the parallel machine
with P processors » s is the time taken for serial code while p is the time taken for
parallel code * Execution time on a single-core processor would be * Let F=p/(s+p). Then SP = (s+pP)/(s+p) = 1-F+FP = 1+F(P-1)
TP = s + p
T1 = s + pP
Qiang Xu CUHK, Fall 2013
Part.1 .37
Challenges in Parallel Computing
¨ Parallel computing exists for decades, but it gets into mainstream (even in your phone) for just a few years, why? * The design of parallel architecture is difficult, but it is
not a road blocker * Parallel programming is hard!!!
» The shift to multicore would not happen if there are alternatives for performance improvement without changing programming model
Qiang Xu CUHK, Fall 2013
Part.1 .38
Why Parallel Programming is Hard?
¨ Programmers need to find parallelism in the algorithm * The good news is that emerging workloads usually have
large data-level parallelism ¨ Programmers need to manage parallel overheads
(e.g., communication and synchronization) ¨ Programmers often need to deal with memory
system explicitly * In order to perform more efficiently, program should
work on local data whenever possible
¨ Some of the above difficulties may be hidden in libraries, compilers and high-level languages, but a long way to go
Qiang Xu CUHK, Fall 2013
Part.1 .39
Memory Systems
¨ Growing gap between processor and memory speed, the so-called “Memory Wall”!
¨ One wants a memory system that is big, fast and cheap at the same time, how?
DRAM: 1.07 CGR
Memory wall = memory_cycle/processor_cycle
In 1990, it was about 4 (25MHz,150ns).Grew to 200 exponentially until 2002Has tappered off since then
Qiang Xu CUHK, Fall 2013
Part.1 .40
Second Level Cache (SRAM)
Memory Hierarchy
Control
Datapath
Secondary Memory (Disk)
On-Chip Components
RegFile
Main Memory (DRAM) D
ata Cache
Instr Cache
ITLB
DTLB
eDRAM
Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s
¨ By taking advantage of the principle of locality!! * Present the user with as much memory as available with the
cheapest technology * At the speed offered by the fastest technology
Size (%cycles): 100’s 10k’s 100k’s G’s 100G’s
Qiang Xu CUHK, Fall 2013
Part.1 .41
What Memory Wall Indeed? ¨ Although still a big problem, the processor/
memory speed gap stopped growing around 2002. * Growing on-chip cache size also mitigates the latency
problem ¨ With multicore, it is the memory bandwidth wall!
From: Sandia National Lab.
Memory bandwidth is constrained by the limited IC pin count and I/O power.
Qiang Xu CUHK, Fall 2013
Part.1 .42
Yet Another Challenge
¨ Hardware is NOT error-free in its lifetime and this problem is exacerbated with scaling!!! * Toyota blames soft error for sudden acceleration problem.
Burn-in test less effective
Higher random failure rate Faster
wear-out
Qiang Xu CUHK, Fall 2013
Part.1 .43
Engineering Design is about Tradeoff!
Performance
Reliability/ Availability
Cost
Design
Power ¨ This course is about how to achieve better
tradeoff at the architecture level * Used to be “Stupid, it’s performance” * Power is often weighed more importantly than performance
nowadays, especially for battery-powered systems * Reliability is becoming a first-class citizen
Security
Qiang Xu CUHK, Fall 2013
Part.1 .44
What would it be in the Next 10 Years
¨ We drop the ball?? * Core number and transistor count stabilize at a
certain point ¨ 100-billion transistor chip with 1000 cores??
¨ New process technology is invented for mainstream adoption??
¨ Domain-specific computing with lots of
accelerators??