LowPowerProcessors&( New(Memory(Technologies(

Low-‐Power Processors & New Memory Technologies

Chris&na Delimitrou

h1p://cs316.stanford.edu

CS316 – Fall 2014 – Lecture 15

2

Announcements

n  Reading n  Lecture notes + papers

n  Reminders n  HW2 (due today) n  Project (progress report due on Wednesday)

n  Exam: Thursday 11/20th, Lathrop 299, 3pm-‐6pm n  All lectures and required reading un&l Monday 11/17 n  Let us know early about alternate exam (+/-‐ 1 day of exam)

3

Low Power Superscalar Cores

n  Intel Atom n  AMD Bobcat

n  ARM Cortex-‐A9

n …

4

Intel Atom

n  A 2-‐way issue, in-‐order x86 processor n  Allows for chips with 0.6W consump&on @800MHz

5

Atom Design Decisions n  2-‐way threaded for u&liza&on/latency reasons n  In-‐order pipeline with 16 stages

n  Got rid of scheduling and reordering logic n  Somewhat long pipeline to accommodate threads

n  Simpler front-‐end n  Avoid breaking up x86 ops to many micro-‐ops

n  Few func&onal units to avoid waste n  Loop cache: avoid fetching/decoding small loops n  Large cache to avoid misses

n  Cache designed to reduce leakage

6

Are Caches Power Efficient?

n  Evidence against n  40% of chip-‐level power goes to LLC + DRAM

n  Evidence for?

[Sodani, 2011]

7

Are Caches Power Efficient?

n  If there is locality, caches can save power n  Anything hidden from this figure?

[Sodani, 2011]

8

Discussion

n  How would you improve the power consump&on of an LLC?

9

Atom Processor V2

n  2-‐way OOO processor n  Larger predictors, improved loop buffer, late alloca&on/early resource

reclama&on, dataless ROB, shared L2 cache, wider SIMD, …

10

Ideas for Power Efficient OOO

n  Avoid copying data n  Use pointers (e.g., mapping tables)

n  Avoid associa&ve structures n  E.g., associa&ve search in ROB or instruc&on window

n  Op&mize for common case n  E.g., instruc&ons with 1 register input + 1 constant

n  Par&&onable resources that can be turned off n  Clustered architectures

n  E.g., Atom’s scheduler

11

Discussion

n  How would you design an instruc&on window without associa&ve search?

n  How can you save power when the processor is running a low ILP program?

12

How Do we Select a Design Point for Low Power OOO?

n  Mul&ple designs seem efficient n  Which one should we use?

n  They operate at different performance/energy points from a very large design space

n  So it all depends on what is your performance/energy constraint!!

13

Exploring the Design Space

1-issue in-order

2-issue in-order

2-issue ooo

4-issue ooo

Optimal Macro-architecture: 4-

in

1-issue out-of-order, never efficient

[Azizi, 2010]

14

Design Space + Voltage Scaling

2-issue ooo 2-issue in-order

n  With voltage scaling, two archs dominate efficiency fron&er

15

What Changes with MulK-‐core?

n  With mul&-‐cores per chip and parallel programs, can we just use the simplest core and rely on parallelism?

n  The problems n  Sequen&al workloads (need a be1er core) n  Amdahl’s law for parallel workloads

n  We s&ll need a capable core for ILP n  And it should be energy efficient

16

Why Do We SKll Care About ILP n  Mark Hill’s argument based on Amdahl’s Law

n  www.cs.wisc.edu/mul&facet/papers/hpca08_keynote_amdahl.ppt

n  Assume a resource limited mul&-‐core n  N basic core equivalent (BCEs) due to area or power constraints n  A 1 BCE core leads to performance of 1 n  A R BCE core leads to performance of perf(R)

n  Assuming perf(R) = sqrt(R) in following drawings

n  How should be design the mul&-‐core? n  Select type and number of cores n  Assump&on: caches, interconnect,etc are rather constant n  Assump&on: no applica&on scaling (or equal scaling for seq/par por&ons)

17

The 3 CMP Design Approaches

Large Cores (R BCEs/core)

Simple Cores (1 BCE/core)

Number Performance Number Performance

Symmetric CMP N/R Seq: Perf(R) Par: (N/R)*Perf(R)

-‐ -‐

Asymmetric CMP 1 Seq: Perf(R) Par: Perf(R)

N-‐R Seq: 1 Par: (N-‐R)*1

Dynamic CMP 1 Seq: Perf(R) Par: -‐

N Seq: -‐ Par: N

18

Amdahl’s Law x3 n  Symmetric CMP

n  Asymmetric CMP

n  Dynamic CMP

Speedup = 1

+ 1 - F Perf(R)

F * R

Perf(R)*N

Speedup = 1

+ 1 - F Perf(R)

F

Perf(R) + N - R

Speedup = 1

+ 1 - F Perf(R)

F

N

19

Conclusions for Symmetric MulK-‐core Chip N = 256 BCEs

As Moore’s Law increases N, onen need enhanced core designs Some researchers should target single-‐core performance

0

50

100

150

200

250

1 2 4 8 16 32 64 128 256

Symmetric

Spe

edup

R BCEs/core

F=0.999

F=0.99

F=0.975

F=0.9 F=0.5 F=0.9

R=28 (vs. 2) Cores=9 (vs. 8) Speedup=26.7 (vs. 6.7) CORE ENHANCEMENTS!

Fà1 R=1 (vs. 1) Cores=256 (vs. 16) Speedup=204 (vs. 16) MORE CORES!

F=0.99 R=3 (vs. 1)

Cores=85 (vs. 16) Speedup=80 (vs. 13.9)

CORE ENHANCEMENTS & MORE CORES!

20

Asymmetric MulKcore Chip N = 256 BCEs

Asymmetric offers greater speedups poten&al than Symmetric Implica&on: we need some ILP core designs

0

50

100

150

200

250

1 2 4 8 16 32 64 128 256

Asym

metric

Spe

edup

R BCEs

F=0.999

F=0.99

F=0.975

F=0.9

F=0.5 F=0.9 R=118 (vs. 28) Cores= 139 (vs. 9) Speedup=65.6 (vs. 26.7)

F=0.99 R=41 (vs. 3) Cores=216 (vs. 85) Speedup=166 (vs. 80)

21

Dynamic MulKcore Chip N = 256 BCEs

Dynamic offers greater speedup poten&al than Asymmetric (but it’s not easy to be jack of all trades) Implica&on: we need some ILP core designs

0

50

100

150

200

250

1 2 4 8 16 32 64 128 256

Dyna

mic Spe

edup

R BCEs

F=0.999

F=0.99

F=0.975

F=0.9

F=0.5

F=0.99 R=256 (vs. 41) Cores=256 (vs. 216) Speedup=223 (vs. 166)

Note: #Cores always N=256

22

Discussion n  How do we reduce energy/power even further ?

n  Remember, we are missing a factor of 2x per genera&on n  Difficult to achieve it by tweaking OOO parameters

n  Methodology?

n  Known alterna&ves to general-‐purpose processors? n  What are their pros and cons?

23

Custom Chips (ASICs)

n  Non-‐programmable chips for a specific task n  2-‐3 orders of magnitude more energy efficient

n  E.g., video encoding chips 500x more energy efficient that mul&cores with high-‐end or low-‐end cores

n  If we want similar efficiency have to use “ASIC techniques”

n  Cons?

ASIC vs 4-core chip for H.264 encoding tasks

24

How About Memory?

n  Energy analysis for speech recogni&on before/aner specializa&on n  What other domains do you expect to be memory limited? n  How do we reduce memory energy?

n  Tradeoffs and issues?

Processor 18%

Memory 82%

Processor 68%

Memory 32%

25

New Memory Technologies

26

New Memory Technologies n  Density

n  How well are we using the area?

n  Latency n  How fast is each memory access?

n  Bandwidth n  How much data can we read at each point in &me?

n  Energy n  How much energy do memory accesses require?

n  Cost n  How expensive is it to buy/maintain/manage?

27

Why Not Just DRAM? n  Advantages:

n  Prevalent – almost every system uses it n  Fast(er) than NVM (~60ns reads) n  High write bw (1000MB/s) n  Structural simplicity (1 transistor + capacitor per bit) n  Moderately dense n  Endurance (infinite)

n  Disadvantages: n  Expensive n  Not that fast (latency is not improving a lot) n  Reten&on (needs refresh – every ~64ms per row) n  Vola&le (loses data on power-‐down) n  High energy overhead

28

Why Not Just DRAM? n  Capacity doubles every two years (Moore’s Law) BUT latency changes li1le n  Can improve latency (and power) by building smaller blocks à hurts density & cost

n  Can improve latency (and power) by being clever about access scheduling & mapping data to rows à increases hit rate, but also increases complexity

n  Will soon hit a density wall à need alterna&ve technologies

29

AlternaKve Technologies

n  Flash n  PCM n  STT-‐RAM n  FRAM (or FeRAM) – Ferroelectric RAM n MRAM (Magneto Resis&ve RAM) n Memristors n …

30

Flash n  Non-‐vola&le memory

n  Does not lose data on power down n  Lower power

n  Two main types: NAND and NOR flash n  NAND: block-‐addressable, main memory, cards, USB flash drives, etc.

n  NOR: byte-‐addressable, replacement for EPROM

n  Each flash cell stores one (SLC) or more (MLC) bits of informa&on n  Works by modula&ng (control gate) electrons stored in the gate of the MOSFET (floa&ng gate)

31

Flash

n  Fairly dense, but near-‐disk write latency

DRAM NAND Flash NOR Flash Density 1 4 0.25

Read Latency 60ns 25,000ns 300ns Bandwidth 1000MB/s 2.4MB/s 0.5MB/s Endurance Eff. Infinite 10^4 10^4 Retention Refresh 10 Years 10 Years

32

Phase Change Memory (PCM)

n  Bit recorded in ‘Phase Change Material’ n  SET to 1 by hea&ng to crystalliza&on point n  RESET to 0 by hea&ng to mel&ng point n  Resistance indicates state n  State change is reversible

33

Phase Change Memory n  Density

n  4x increase over DRAM n  Latency

n  4x increase over DRAM n  Energy

n  No leakage n  Reads are worse(2x), writes much worse (40x)

n  Wear out n  Limited number of writes (but be1er than Flash)

n  Non-‐vola&le n  Data persists in memory n  Does not require a separate erase step like Flash

34

Phase Change Memory

DRAM NAND Flash NOR Flash PCM Density 1 4 0.25 2-4

Read Latency 60ns 25,000ns 300ns 200-300ns Bandwidth 1000MB/s 2.4MB/s 0.5MB/s 100MB/s Endurance Eff. Infinite 10^4 10^4 10^6 to

10^8 Retention Refresh 10 Years 10 Years 10 Years

35

Phase Change Memory

n Main problems (compared to DRAM)?

36

SoluKons to wearing & energy

n  Two main techniques?

37

SoluKons to wearing & energy

n  ParKal writes = write only bits that have changed n  Caches keep track of wri1en bytes/

words per cacheline (Lee et. al) n  storage overhead vs.

accuracy n  When wri&ng a row to memory,

first read old row and compare => write only modified bits (Zhou et al.)

Writes cause thermal expansion / contraction that wears the material and requires strong current. But contrary to DRAM, PCM does not leak energy.

Most written bits redundant!

38

SoluKons to wearing & energy (cont.)

n  Buffer organizaKon (Lee et al.) n  DRAM uses one row buffer (2048B) n  use (up to 32 * 64B) narrow buffers,

each with own associa&on n  capture coalescing writes: spa&al locality (temporal locality captured by LLC)

n  find 4*512B most effec&ve n  same area as DRAM’s buffers n  hide long PCM latency

n  Small DRAM buffer for PCM (Qureshi et al.) n  combine low latency of DRAM with

high capacity of PCM n  similarly use Flash cache for Disk

39

PCM as On-‐chip Cache n  Hybrid on-‐chip cache architecture consis&ng of mul&ple memory

technologies n  PCM, SRAM, embedded DRAM (eDRAM), and Magne&c RAM (MRAM)

n  PCM is slow compared to SRAM etc. n  But high density, non-‐vola&lity etc. help

•  Use as complement to faster memory technologies •  As “slow” L2 cache, as L3 cache etc.

PCM

40

STT-‐RAM n  STT-‐RAM: Spin-‐transfer torque RAM

n  Non-‐vola&le technology n  Opera&on: change the orienta&on of a magne&c layer in a magne&c tunnel junc&on (or spin valve)

n  Essen&ally creates spin-‐polarized current by passing an electric current through a think magne&c material (fixed layer) à direct current to second thin magne&c material (free layer) to change its orienta&on

n  Needs lower current than tradi&onal MRAM à higher densi&es

41

STT-‐RAM n  Advantages:

n  Higher density than RAM (lower current needed) n  Non-‐vola&le (can replace SRAM for processor caches) n  Low leakage à low sta&c power consump&on n  High endurance n  Good performance (reads)

n  Disadvantages: n  High dynamic energy n  Slow write latencies n  Lower endurance compared to RAM

42

Comparison of AlternaKve Technologies

43

STT-‐RAM OpKmizaKons n  Problems with STT-‐RAM?

44

STT-‐RAM OpKmizaKons n  Ideal STT-‐RAM

n  What did we sacrifice?

45

STT-‐RAM OpKmizaKons n  Reduced reten&on &me STT-‐RAM

n  Reduce the area of the free layer of the magne&c tunnel junc&on (MTJ) (storage element for STT-‐RAM cell) à reduce energy needed to write to the cell

n  Is sacrificing reten&on a good idea? n  How much should we sacrifice?

n  How can they scale from small structures to large structures?

n  Do we need any new opera&ons?

46

Memristors n  Fourth element: inductor, resistor, capacitor + memristor

(non-‐linear passive two-‐terminal component)

n  A memristor’s resistance depends on how much current has passed through it in the past (has memory) à remembers its most recent resistance un&l when it’s turned on again n  Much higher density than current NVM n  Similar access &mes to DRAM n  Could replace both theore&cally

n  March 2012 à first func&oning memristor array on CMOS chip

n  Commercial availability à ~2018

47

AlternaKve Memory Systems n  Not necessarily change the memory technology (can s&ll use DRAM), but change the way the memory system is designed and managed n  Reduce overfetch à reduce power by being clever about how much data is read (read fewer chips, or only parts of a row)

n  Build hybrid/heterogeneous memory systems n  Near-‐Data Processing (NDP) n  3D stacked RAM n  …

48

Near Data Processing

n  Near Data Processing (NDP): n  Also known as Processing in Memory (PIM) n  Add some logic to the memory system à reduce data movement à reduce energy and latency of memory accesses

n  Early commercial solu&on: HMC (Hybrid Memory Cube): 3D stacked memory with some logic

n  Trade-‐offs: n  How much logic? Only NDP? Problems with that? n  How to par&&on the applica&on? n  How to communicate between cores? n  Specializa&on or not?

49

SoluKons to wearing & energy n  Wear leveling (Zhou et al.)

n  row shining: even out writes among cells in a row

n  needs extra hardware n  segment swapping: even out between

pages n  implemented in memory controller

Spatial locality is now a problem!

Documents

LowPowerProcessors&( New(Memory(Technologies(