Advanced Computer Architecture Week 1: Introductionstrukov/ece154bSpring2017/week1.pdf · Textbook...

Advanced Computer ArchitectureWeek 1: Introduction

ECE 154BDmitri Strukov

Outline

• Course information• Trends (in technology, cost, performance) and issues

Course organization• Class website : http://www.ece.ucsb.edu/~strukov/ece154bSpring2017/home.htm

• Instructor office hours: by appointment

• Teacher Assistant: Ms. Nan Wulocation: Trailer 699, Room 104office hours: Wed 2:00 pm – 3:00 pm, Fri 3:00 pm

to 4:00 pmemail: nanwu@umail.ucsb.edu

Textbook• Computer Architecture: A Quantitative Approach, John L. Hennessy and David A. Patterson, Fifth Edition, Morgan Kaufmann, 2012, ISBN: 978‐0‐12‐383872‐8

• Modern Processor Design: Fundamentals of Superscalar Processors, John Paul Shen and Mikko H. Lipasti,Waveland Press, 2013, ISBN: 978‐1‐47‐860783‐0

• Digital Design and Computer Architecture, David Harris and Sarah Harris, 2nd Ed., 2012

Class topics and tentative schedule• Computer fundamentals (historical trends, performance metrics) – 1 week

• Memory hierarchy design ‐ 2 weeks• Instruction level parallelism (static and dynamic scheduling, speculation) – 2 weeks

• Data level parallelism (vector, SIMD and GPUs) – 2 weeks

• Thread level parallelism (shared‐memory architectures, synchronization and cache coherence) – 2 weeks

• Warehouse‐scale computers (1 week) – time permitted

Ultimate goal of the class

• To get intuition behind main techniques for improving performance

• To understand advanced microprocessors such as

‐ ARM Cortex A8‐ Intel Core i7‐ Tesla GPU

5‐STAGE MIPS PIPELINEThis is what you supposed to know!

This is what we (hopefully) learn in this class!

Grading• Projects: 100 % (done in pairs – find lab partner ASAP)

– Verilog design of toy ARM microprocessor – 4 projects total (2 weeks each starting this week)

• 5‐stage pipelined MIPS• Simple cache• Branch predictor• Multi‐issue + more advanced cache

– Assignment for this/next week:• Review 5‐stage MIPS• Review Verilog (see Ch 4 from Harrison & Harrison and labs in ECE154A)

Course prerequisites• ECE 154A or equivalent

Trends in Computing Technology (with Brief Intro on Chip Economics)

A bit of history: ENIAC ‐ Electronic Numerical Integrator And Computer, 1946

Historical Progress

1946: ENIAC electronic numerical integrator and computer

• Area: – 140 m2 floor area (30 tons)– ~17,000 vacuum tubes, 7200 crystal

diodes, 1500 relays, 70,000 resistors, 10,000 capacitors, 5M hand‐soldered joints

• Performance– multiplication of two 10‐digit

numbers in 2 ms

2011: High Performance microprocessor

• Chip area– 100‐400 mm2 (for multi‐core)

• Board area– 200 cm2; improvement of 104

• Performance: – 64 bit multiply in few ns;

improvement of 106

Crossroads: Conventional Wisdom in Computer Architecture

Credit: SBU, M. Dorojevets

Computer trends: Performance of a (single) processor

13The next series of question is centered around understanding this important graph

Question

• Q1: what is performance shown on the figure and how do we define it?

Question

• Q1: what is performance shown on the figure and how do we define it?‐ A1a: Performance is typically related to how fast a certain task can be executed, i.e. reciprocal of execution time

Performance = 1/ ExecTime ExecTime = IC * CCT * <CPI>

• Wall clock time: includes all system overheads• CPU time: only computation time

‐ A1b: Many different metrics of performance today because of different applications of uPs‐ What kind of metrics?

Measuring Performance• Typical performance metrics:

– Execution Time (or latency) – Throughput

• Q2: How is throughput related to latency?

– Energy • Q3: Is energy metric the same as power consumption one?

• Typical way to measure performance is to run benchmark (i.e. collection of representative for the tested hardware application)– Kernels (e.g. matrix multiply)– Toy programs (e.g. sorting)– Synthetic benchmarks (e.g. Dhrystone)– Benchmark suites (e.g. SPEC06fp, TPC‐C)

• Speedup of X relative to Y– Execution timeY / Execution timeX 16

Measuring Performance• Typical performance metrics:

– Execution Time (or latency) – Throughput

• Q2: How is throughput related to latency?– A2: In general these are two different concepts. Throughput can be improved by providing more

parallelism, but also be improved by reducing latency. For example, with no parallelism throughput is reversely proportional to latency

– Energy • Q3: Is energy metric the same as power consumption one?

– A3: Power = energy / time, so in general, it is the same metric only when execution time is the same.

• Typical way to measure performance is to run benchmark (i.e. collection of representative for the tested hardware application)– Kernels (e.g. matrix multiply)– Toy programs (e.g. sorting)– Synthetic benchmarks (e.g. Dhrystone)– Benchmark suites (e.g. SPEC06fp, TPC‐C)

• Speedup of X relative to Y– Execution timeY / Execution timeX 17

Bandwidth vs. Latency

• Bandwidth or throughput– Total work done in a given time– 10,000‐25,000X improvement for processors– 300‐1200X improvement for memory and disks

• Latency or response time– Time between start and completion of an event– 30‐80X improvement for processors– 6‐8X improvement for memory and disks

Questions:

• Reasons behind performance improvement?• Q4: Why it was improving originally (from ~1978‐~1984 on the figure) ?

Questions:

• Reasons behind performance improvement?• Q4: Why it was improving originally (from ~1978‐~1984 on the figure) ?

– A4: Moore’s law and the resulting increase in clock frequency

CMOS improvements:• Transistor density: 4x / 3 yrs• Die size: 10-25% / yr

Scaling with Feature Size(Dennard scaling)

Let’s 1) scale all the dimensions of the transistors and wires down by factor of sand2) supply voltage V down by factor of s (together with threshold voltage Vth)

Then • Density: ~ s2

• Logic gate capacitance Cgate (traditionally dominating parasitics): ~ 1/s • Saturation current ION : ~ 1/s• Gate delay Tgate: ~ CgateV/ION = 1/s • Clock frequency: s , i.e. it is reversely proportional to gate delay. Clock

cycle time is typically around ten or more of logic gate delays

See, e.g. page 124 of Digital Integrated Circuits by Jan Rabaey et al, 2nd edition

Frequency Scaling with Feature Size

• If s is scaling factor, then density scale as s2

• Voltage V: 1/s

• Logic gate capacitance C (traditionally dominating): ~ 1/s

• Saturation current ION : ~ 1/s

• Gate delay: ~ CV/ION = 1/s

Question:

• Q5: Reasons behind further performance improvement?• What happened in 1986?

Question:

• Q5: Reasons behind further performance improvement?• What happened in 1986?– A5: CISC to RISC which enabled additional architectural improvements (see next slide)

Review: Dimensions of ISA(1) Class of ISA: register‐memory vs load‐store(2) Memory addressing: byte addressable(3) Addressing modes (what are operands and addressing modes of memory): registers,

immediate, displacement, indirect, indexed, absolute)(4) Types and sizes of operands: byte, half‐word, word(5) Operations: data transfer, arithmetic logical, control and fp(6) Control flow instructions: conditional branches, unconditional jumps, returns(7) Encoding an ISA: variable versus fixed length

Question:• Reasons behind performance improvement?

• What happened in 1986?– CISC to RISC

– Q6: How are these terms affected by this move and in particularwhat terms in the performance equation are affected by pipelining?

ExecTime = IC * CCT * <CPI>

Question:• Reasons behind performance improvement?

• What happened in 1986?– CISC to RISC

– Q6: How are these terms affected by this move and in particularwhat terms in the performance equation are affected by pipelining?

‐A6:

Design Instcount

CPI CCT

Single Cycle (SC) 1 1 1

Multi cycle (MC) 1 N ≥ CPI > 1(closer to N than 1)

Multi cycle pipelined (MCP)

1 > 1 >1/N

ExecTime = IC * CCT * <CPI>

Question:

• Pipelining improves performance (reduces instruction per cycle with respect to multi‐cycle processor without pipelining) by overlapping instructions

• One kind of instruction level parallelism (ILP)• Q7: Problems with improving ILP?

• What are the problems in pipelines?

Question:

• Pipelining improves performance (reduces instruction per cycle with respect to multi‐cycle processor without pipelining) by overlapping instructions

• What are the problems in pipelines? – A7: Clock cycle is determined by slowest component

» What is typically the slowest component? memory– A7: Data and control hazards (pipeline stalls and flushes)

• Further improvement in ILP?– A7: Limited parallelism in ILP

“Memory Wall” problem

• DRAM access (main memory) could take hundreds of cycles • Memory hierarchy to rescue to alleviate the problem

– Will spend much time later in class reviewing advanced techniques for reducing effective access time to main memory

Bandwidth and Latency

Log-log plot of bandwidth and latency milestones

Performance Milestones• Processor: ‘286, ‘386, ‘486,

Pentium, Pentium Pro, Pentium 4 (21x,2250x)

• Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x)

• Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x)

• Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x)

CPU high, Memory low(“Memory Wall”)

Bandwidth is much easier to improve – why?

Question:

• Pipelining improve performance (instruction per cycle with respect to multi‐cycle processor with non pipelining, by overlapping instructions)

• What are the problems in pipelines? – A7: Clock cycle is determined by slowest component

» What is typically the slowest component? memory– A7: Data and control hazards (pipeline stalls and flushes)

• Further improvement in ILP?– A7: Limited parallelism in ILP

ILP techniques

Summary of Trends in Technology (so far)

• Integrated circuit technology (slowing to a halt)– Transistor density: 35%/year– Die size: 10‐20%/year– Integration overall: 40‐55%/year

• DRAM capacity: 25‐40%/year (slowing to a halt)

• Flash capacity: 50‐60%/year (some life with 3D NAND)– 15‐20X cheaper/bit than DRAM

• Magnetic disk technology: 40%/year (slowing)– 15‐25X cheaper/bit then Flash– 300‐500X cheaper/bit than DRAM

Computer Trends: Performance of a (Single) Processor

The area of high performance chip has been always close to ~ cm^2, why?

Question:• Q8: Why did the die size only grew by 10% / year?

– Performance of single processor could be improved by using more hardware (larger cache, more sophisticated branch prediction etc.)

Drawing single‐crystalSi ingot from furnace…. Then, slice into wafers and pattern it…

8” MIPS64 R20K wafer (564 dies)

Trends in Cost

• Cost driven down by learning curve– Yield

• DRAM: price closely tracks cost

• Microprocessors: price depends on volume– 10% less for each doubling of volume

IC cost = Die cost + Testing cost + Packaging costFinal test yield

Die cost = Wafer costDies per Wafer * Die yield

Final test yield: fraction of packaged dies which pass the final testing state

Die yield: fraction of good dies on a wafer

Integrated Circuits Costs

Defects per unit area = 0.016‐0.057 defects per square cm (2010)N = process‐complexity factor = 11.5‐15.5 (40 nm, 2010)

10 2 0.1 1 10 100

10 510 410 310 2

Die area (cm^2)

Die yield / wafer yieldDefects per unit area = 0.016

Defects per unit area = 0.057

N = 11.5

Die area (cm^2)

Die cost (arbitrary units)

10 2 0.1 1 10 100

Answer to Q8

ASIC vs. uP

100 105 1081

1000104105106 $1 M NRE (non recurrent engineering cost)

Volume

Total cost $

Total cost = NRE/volume + IC cost

IC cost = $100

IC cost = $1

Q9: ‐What is typically denser ASIC or uP for the same task? ‐What is typically more energy efficient and faster? ‐What cost less to produce ASIC or uP?

this is just an example of NRE cost. It may vary by much in in general total cost for uP > that of ASIC

ASIC vs. uP

100 105 1081

1000104105106 $1 M NRE (non recurrent engineering cost)

Volume

Total cost $

Total cost = NRE/volume + IC cost

IC cost = $100

IC cost = $1

Q9: ‐What is typically denser ASIC or uP for the same task? ASIC‐What is typically more energy efficient and faster? ASIC‐What cost less to produce ASIC or uP? depends on volume (see graph above)

this is just an example of NRE cost. It may vary by much in in general total cost for uP > that of ASIC

Density, speed Flexibility

Application Specific Integrated

Circuit

Field Programmable Gate Array

Microprocessor

Major Computing Platforms

In this class, the focus is on the microprocessors only

The Twilight of Moore’s Law: Economics

Questions:• Reasons behind performance improvement?

• Q10: What happened after > 2002 on the performance figure?

• A10a: Power wall

• A10b: End of ILP– Limits to pipelining– Limits to superscalar

Power Consumption

• Intel 80386 consumed ~ 2 W

• 3.3 GHz Intel Core i7 consumes 130 W

Problem: Get power in, get power out

Thermal Design Power (TDP) ‐ Characterizes sustained power consumption, used as target for power supply and cooling system, Lower than peak power, higher than average power consumption

Maximum power density forfan‐based cooling: 200W/cm^2water based cooling: 1000W/cm^2

Typical max temperatures: ~70 C

Water Cooling in a Google Data Center

Ambient temperature (Tlow)

Chip temperature (Thigh)

Heat flux (Q)

Fourier Law in 1D : Similar to Ohms law when replacing ‐ thermal conductance with electrical conductance ‐ heat source (total generated power) with current source‐ temperature with voltage

Thermal conductance K

Thigh = Tlow + Q/K Vhigh = Vlow + IR

Temperature is roughly (in 1D lumped model) linearly proportional to the Q or total dissipated power

Heating as a Function of Power

Scaling with Feature Size(Dennard scaling)

Let’s 1) scale all the dimensions of the transistors and wires down by factor of sand2) supply voltage V down by factor of s (together with threshold voltage Vth)

Then • Density: ~ s2

• Logic gate capacitance Cgate (traditionally dominating parasitics): ~ 1/s • Saturation current ION : ~ 1/s• Gate delay Tgate: ~ CgateV/ION = 1/s • Clock frequency f : s , i.e. it is reversely proportional to gate delay. Clock cycle

time is typically around ten or more of logic gate delays• Power (dynamic component only): ~1/2 Ctotal*V2*f ~ 1 If chip area remain the same, power scales is the same as power density but (a) f scaled faster than s , and (b) end of Dennard scaling

Technique for Reducing Power Consumption– Do nothing well

• Low power state for DRAM, disks• Energy proportionality concept (don’t consume energy when no work is done) very important for data center for which power is huge portion of running cost

• Power gating to reduce static component– Dynamic Voltage‐Frequency Scaling

• Q11: Any benefits for multiprocessors?

– Overclocking, turning off cores• Race‐to‐halt• Thermal capacitance/ turbo mode 54

Since saturation current ION ~ V2 f ~ 1/Tgate ≈ ION/ (Cgate V )~ VLowering voltage reduces the dynamic power consumption and energy per operation but decrease performance because of increased CCT

Technique for Reducing Power Consumption– Do nothing well

• Low power state for DRAM, disks• Energy proportionality concept (don’t consume energy when no work is done) very important for data center for which power is huge portion of running cost

• Power gating to reduce static component– Dynamic Voltage‐Frequency Scaling

• Q11: Any benefits for multiprocessors?– A11: If task is easily parallelizable, then running this task on p processors in parallel at lower V (say V/p) and slower f (say f/p) can lead to the same execution time but much lower dynamic power CtotalV^2f ~ 1/p^3 (not accounting for static power)

– Overclocking, turning off cores• Race‐to‐halt• Thermal capacitance/ turbo mode

Since saturation current ION ~ V2 f ~ 1/Tgate ≈ ION/ (Cgate V )~ VLowering voltage reduces the dynamic power consumption and energy per operation but decrease performance because of increased CCT

Static vs. dynamic power

Leakage (static power) increases exponentially when lowering V! Cannot be neglected anymore

Static power

Dynamic power

Static power is permanentDynamic power only when switching

Leakage power ~ V^2/Roff Roff/Ron ~ Exp(V)

The End of Voltage (Dennard) Scaling

Other Problems with Scaling: Transistors and Wires

• Feature size– Minimum size of transistor or wire in x or y dimension

– 10 microns in 1971 to .032 microns in 2011

– Transistor performance scales linearly

– Integration density scales quadratically

– Wire delay does not improve with feature size!

– There is always need in long wires• Problem related to Rent Rule (number of pins versus number of gates)

• A10a: Power wall

• A10b: End of ILP– Limits to pipelining– Limits to superscalar

» Will discuss it in detail after covering advanced ILP topics

Summary of Trends in uP

What is Next: Current Trends in Architecture

• Cannot continue to leverage Instruction‐Level parallelism (ILP)– Single processor performance improvement ended in 2003

• New ways of improving performance:– Data‐level parallelism (DLP)– Thread‐level parallelism (TLP)– Request‐level parallelism (RLP)

• These require explicit restructuring of the application

Transition to Multicore

Dark Silicon

Only some parts of a chip are active at a time

Q12: Specialized cores make sense now in general purpose microprocessor

Qualcomm Zeroth chip

New Applications Appear: Classes of Computers Now

• Personal Mobile Device (PMD)– e.g. start phones, tablet computers– Emphasis on energy efficiency and real‐time

• Desktop Computing– Emphasis on price‐performance

• Servers– Emphasis on availability, scalability, throughput

• Clusters / Warehouse Scale Computers– Used for “Software as a Service (SaaS)”– Emphasis on availability and price‐performance– Sub‐class: Supercomputers, emphasis: floating‐point performance and

fast internal networks• Embedded Computers

– Emphasis: price

Motivation for Neuromorphic Computing

• Biology beats computers at many emerging tasks– Image/audio/signal processing for

• Robotics• Sensor networks

• Human brain simulations are very demanding

Artificial Neural NetworksComplexity~ 1011 neurons~ 1015 synapsesConnectivity~ 1 : 10000

100 steps long rule: few to several hundred hertz; face recognition in ~100 ms

2‐3 mm think , 2200 cm2

Google’s Tensor Processing Unit

STATE-OF-THE-ART HARDWARE FOR DEEP LEARNING: CUSTOM DIGITAL CIRCUITS

Movidius’s fanthom

15 inferences /sec @ 16‐bit FP precision for ImageNet@ <2W

Nvidia’s Pascal21 TFLOPS for deep learning performance

Advanced Computer Architecture Week 1: Introductionstrukov/ece154bSpring2017/week1.pdf · Textbook...

Documents

CIS 501 Computer Architecture What is Computer Architecture

Advanced Computer Architecture Week 1: Introductionstrukov/ece154bWinter2015/week1.pdf · • Modern Processor Design: Fundamentals of Superscalar Processors, John Paul Shen and Mikko

Computer Architecture Is … Parallel Computer Architecture

Computer Architecture 6001215-3 Lec 06: Computer Architecture Introduction

Computer Architecture Instruction Set Architecture

Advanced Computer Architecture Week 1: Introductionstrukov/ece154bSpring2016/week1… · · 2016-04-04Textbook •Computer Architecture: A Quantitative Approach, John L. Hennessy

computer architecture

Computer Architecture II 1 Computer architecture II Lecture 8

Computer Architecture II 1 Computer architecture II Network topologies

Computer Architecture - Introduction€¦ · Performance Computer Architecture Renaissance. 6 What is “Computer Architecture”? Applications Instruction Set Architecture (ISA)

CPE 408441 Computer Architecture Fall2009 Appendix A ...elcom-hu.com/Computer/Computer Architecture/Computer Architecture... · Appendix A: Pipelining: Basic and Intermediate Concepts

Computer Architecture II 1 Computer architecture II Lecture 9

Advanced Computer Architecture - niilmuniversity.in Computer Architecture. Subject: ADVANCED COMPUTER ARCHITECTURE Credits: 4 SYLLABUS Computer System Generation of computers, Classification

Computer Architecture And Organization. Difference between computer organization and computer architecture Computer architecture is the architectural

New ECE 4750 Computer Architecture Course Overview · 2017. 8. 14. · What is Computer Architecture? Activity 1 • Trends in Computer Architecture • Activity 2 Computer Architecture

Computer advanced computer architecture

332 Advanced Computer Architecture Introduction … Architecture/Lecture1.pdfAdvanced Computer Architecture Chapter 1. p2 Pre-requisites This a third-level computer architecture course

Advanced Computer Architecture - Baylor ECSmaurer/aida/courses/archintro.pdfAdvanced Computer Architecture The Architecture of Parallel Computers Computer Systems Hardware Architecture

Computer Architecture 2011 – Advanced Topics 1 Computer Architecture Advanced Topics

Advanced Computer Architecture Week 1: Introductionstrukov/ece154BSpring2019/week1.pdf · •Computer Architecture: A Quantitative Approach, John L. Hennessy and David A. Patterson,