Advanced Computer Architecture - Piazza

Qiang Xu CUHK, Fall 2013

Part.1 .1

ENGG 5101

Advanced Computer Architecture

Lecture 01 - Introduction XU, Qiang (Johnny) 徐強


Part.1 .2

Course Genereal Information ¨ Instructor: Qiang Xu

* http://www.cse.cuhk.edu.hk/~qxu * Office hours: 1-3pm, Tuesday

¨ Course Info * http://www.cse.cuhk.edu.hk/engg5101

¨ TA: Zelong Sun * [email protected]

¨ Check student/faculty expectations on teaching and learning (available on course webpage)


Part.1 .3

Course Objective ¨ Learn the organizational paradigms that determine

the capabilities, performance, power consumption and reliability of computer systems * The what, the how, and more importantly, the why * Processor microarchitecture * Memory hierarchies and cache coherence

¨ Focus on parallel organization and design, e.g., superscalar/VLIW and multiprocessors

¨ Learn how to read and evaluate research papers


Part.1 .4

Prerequisites

¨ Basic courses in *  Digital Design *  Hardware Organization/Computer Architecture


Part.1 .5

References

¨ Reference Books * M. Dubois, M. Annavaram, P. Stenstrom, Parallel

Computer Organization and Design, Cambridge, 2012. * J. Hennessy and D. Patterson, Computer Architecture, A

Quantitative Approach, 5th ed., Morgan-Kaufman 2012.

¨ Papers listed on the course webpage

Acknowledgement: Some slides adapted from reference slides of these books


Part.1 .6

Course Structure and Grading Scheme

¨  Lectures: *  2 weeks review of basic concepts and scalar processor *  2 weeks on advanced single-core processor *  1 week on memory systems *  1 week on power and reliability *  3 weeks on multiprocessor systems *  1 week on future trends

¨ What you need to do? *  homework assignments – 20% *  2 research essays (by group) - 30% *  midterm and final exam – 50% *  final exam grade must exceed 50 (out of 100) to pass!

This course is NOT for everyone!!!


Part.1 .7

What’re your Choices for Computing?

Ener

gy E

fficienc

y (in

MOPS

/mW

)

Flexibility (or application scope)

0.1-1

1-10

10-100

100-1000

None Fully flexible

Somewhat flexible

Har

dwired

cus

tom

Conf

igur

able/P

aram

eter

izab

le

Dom

ain-

spec

ific p

roce

ssor

(e

.g., G

PU, DSP

)

micro

proc

esso

r


Part.1 .8

Layered View of Computer Systems


Part.1 .9

Von Neumann Architecture

John von Neumann “the last of the great

mathematicians”

Alan Turing “Father of Computer

Science and AI”


Part.1 .10

Below the Program

¨ High-level language program (in C) swap (int v[], int k) (int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; )

¨ Assembly language program (for MIPS) swap: sll $2, $5, 2 add $2, $4, $2 lw $15, 0($2) lw $16, 4($2) sw $16, 0($2) sw $15, 4($2) jr $31

¨ Machine (object) code (for MIPS) 000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000

. . .

C compiler

assembler

one-to-many

one-to-one


Part.1 .11

Input Device Inputs Object Code

Processor

Control

Datapath

Memory

000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000

Devices

Input

Output

Network


Part.1 .12

Object Code Stored in Memory

Processor

Control

Datapath

Memory

000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000

Devices

Input

Output

Network


Part.1 .13

Processor Fetches an Instruction

Processor

Control

Datapath

Memory

000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000

Processor fetches an instruction from memory

Devices

Input

Output

Network


Part.1 .14

Control Decodes the Instruction

Processor

Control

Datapath

Memory

000000 00100 00010 0001000000100000

Control decodes the instruction to determine what to execute

Devices

Input

Output

Network


Part.1 .15

Datapath Executes the Instruction

Processor

Control

Datapath

Memory

contents Reg #4 ADD contents Reg #2

results put in Reg #2

Datapath executes the instruction as directed by control

000000 00100 00010 0001000000100000

Devices

Input

Output

Network


Part.1 .16

What Happens Next?

Processor

Control

Datapath

Memory

000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000

Fetch

Decode Exec

Devices

Input

Output

Network


Part.1 .17

Processor Fetches the Next Instruction

Processor

Control

Datapath

Memory

000000 00000 00101 0001000010000000 000000 00100 00010 0001000000100000 100011 00010 01111 0000000000000000 100011 00010 10000 0000000000000100 101011 00010 10000 0000000000000000 101011 00010 01111 0000000000000100 000000 11111 00000 0000000000001000

Processor fetches the next instruction from memory

How does it know which location in memory to fetch from next?

Devices

Input

Output

Network


Part.1 .18

Output Data Stored in Memory

Processor

Control

Datapath

Memory

00000100010100000000000000000000 00000000010011110000000000000100 00000011111000000000000000001000

At program completion the data to be output resides in memory

Devices

Input

Output

Network


Part.1 .19

Output Device Outputs Data

Processor

Control

Datapath

Memory

00000100010100000000000000000000 00000000010011110000000000000100 00000011111000000000000000001000

Devices

Input

Output

Network


Part.1 .20

What Differentiates Various Computer Architecture?

¨  The conceptual design and fundamental operational structure of a computer system *  Instruction set architecture (ISA)

»  Programming model of a processor »  Instructions, data types, registers, addressing modes, etc. »  Not many ISAs survive over the years

*  Microarchitecture »  How to implement the ISA at high-level »  Pipelining, cache, branch prediction, superscalar, out-of-

order execution, register renaming, multi-this, multi-that, etc.


Part.1 .21

Modern PC Architecture


Part.1 .22

Modern Smartphone Architecture

From: TI website


Part.1 .23

Generic Parallel Compute Architecture


Part.1 .24

Moore’s Law for CPUs and DRAMs

From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.!


Part.1 .25

Main driver: device scaling ...

From: “Facing the Hot Chips Challenge Again”, Bill Holt, Intel, presented at Hot Chips 17, 2005.!


Part.1 .26

Secondary driver: Wafer size

From: “Facing the Hot Chips Challenge Again”, Bill Holt,

Intel, presented at Hot Chips 17, 2005.!


Part.1 .27

Intel Core i7 Processor

45nm technology, 18.9mm x 13.6mm, 0.73billion transistors, 2008


Part.1 .28

Intel Core i7 Processor


Part.1 .29

Highest Clock Rate of Intel Processors

»  Due to process improvements »  Deeper pipeline »  Circuit design techniques

What if the exponential increase had kept up? Why not?


Part.1 .30

What will happen??


Part.1 .31

Power Density (if Increasing Clock Rate Exponentially as Before)

4004 8008 8080

8085

8086

286 386 486

Pentium® proc P6

1

10

100

1000

10000

1970 1980 1990 2000 2010 Year

Powe

r Den

sity

(W

/cm2)

Hot Plate

Nuclear Reactor

Rocket Nozzle

Power density too high to keep junctions at low temp

Courtesy, Intel


Part.1 .32

POWER is the King Now!

¨ Total power = Dynamic power + Static Power Pdynamic = αCV2f

Pstatic = VIsub ≈ Ve-KVt/T


Part.1 .33

Hitting “Power Wall” - Go for Multi-Core

P. Gargini Intel Developer’s Forum 2005


Part.1 .34

Parallel Computing for Higher Performance

¨ Classes of parallelism: * Instruction-Level Parallelism (ILP)

»  Pipelining, Superscalar, VLIW, EPIC * Data-Level Parallelism (DLP)

»  Vector architectures, GPU, SIMD extension for multimedia * Thread-Level Parallelism

»  SMT, Multiprocessor * Request-Level Parallelism

»  Warehouse-scale computing


Part.1 .35

Amdahl’s Law

¨ Lessons learned * Focus on the common case in design!! *  the law of diminishing returns

¨ In practice, super-linear acceleration is observed in some rare cases, how is that possible?

1-F F

Apply enhancement

1-F F/S

without E

with E

Speedup = 1

(1− F)+ FS

< 11− F


Part.1 .36

Gustafson’s Law ¨ When more cores are available, the workloads are also

growing *  Let us start with the execution time on the parallel machine

with P processors »  s is the time taken for serial code while p is the time taken for

parallel code *  Execution time on a single-core processor would be *  Let F=p/(s+p). Then SP = (s+pP)/(s+p) = 1-F+FP = 1+F(P-1)

TP = s + p

T1 = s + pP


Part.1 .37

Challenges in Parallel Computing

¨ Parallel computing exists for decades, but it gets into mainstream (even in your phone) for just a few years, why? * The design of parallel architecture is difficult, but it is

not a road blocker * Parallel programming is hard!!!

»  The shift to multicore would not happen if there are alternatives for performance improvement without changing programming model


Part.1 .38

Why Parallel Programming is Hard?

¨ Programmers need to find parallelism in the algorithm * The good news is that emerging workloads usually have

large data-level parallelism ¨ Programmers need to manage parallel overheads

(e.g., communication and synchronization) ¨ Programmers often need to deal with memory

system explicitly * In order to perform more efficiently, program should

work on local data whenever possible

¨ Some of the above difficulties may be hidden in libraries, compilers and high-level languages, but a long way to go


Part.1 .39

Memory Systems

¨ Growing gap between processor and memory speed, the so-called “Memory Wall”!

¨ One wants a memory system that is big, fast and cheap at the same time, how?

DRAM: 1.07 CGR

Memory wall = memory_cycle/processor_cycle

In 1990, it was about 4 (25MHz,150ns).Grew to 200 exponentially until 2002Has tappered off since then


Part.1 .40

Second Level Cache (SRAM)

Memory Hierarchy

Control

Datapath

Secondary Memory (Disk)

On-Chip Components

RegFile

Main Memory (DRAM) D

ata Cache

Instr Cache

ITLB

DTLB

eDRAM

Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s

¨ By taking advantage of the principle of locality!! * Present the user with as much memory as available with the

cheapest technology * At the speed offered by the fastest technology

Size (%cycles): 100’s 10k’s 100k’s G’s 100G’s


Part.1 .41

What Memory Wall Indeed? ¨ Although still a big problem, the processor/

memory speed gap stopped growing around 2002. * Growing on-chip cache size also mitigates the latency

problem ¨ With multicore, it is the memory bandwidth wall!

From: Sandia National Lab.

Memory bandwidth is constrained by the limited IC pin count and I/O power.


Part.1 .42

Yet Another Challenge

¨ Hardware is NOT error-free in its lifetime and this problem is exacerbated with scaling!!! * Toyota blames soft error for sudden acceleration problem.

Burn-in test less effective

Higher random failure rate Faster

wear-out


Part.1 .43

Engineering Design is about Tradeoff!

Performance

Reliability/ Availability

Cost

Design

Power ¨ This course is about how to achieve better

tradeoff at the architecture level * Used to be “Stupid, it’s performance” * Power is often weighed more importantly than performance

nowadays, especially for battery-powered systems * Reliability is becoming a first-class citizen

Security


Part.1 .44

What would it be in the Next 10 Years

¨ We drop the ball?? *  Core number and transistor count stabilize at a

certain point ¨  100-billion transistor chip with 1000 cores??

¨ New process technology is invented for mainstream adoption??

¨  Domain-specific computing with lots of

accelerators??

Documents

Advanced Computer Architecture - Piazza