Nov-29 · • 3D technology creates multiple active dies stacked vertically on a single chip and uses short Through Silicon Vias (TSVs) to interconnect circuits across layers to reduce

11/30/16

1

1

CSCE 6610:Advanced Computer Architecture

Project presentation on Tuesday Dec 6Also need your project reports on Dec 6

Take home final exam assigned on Dec 6Due on Dec 13.

Since it is a take home, I expect a comprehensive or detailed discussions

Nov. 29, 2016

Last week we discussed Decoupled Software Pipeliningto use multiple cores for a single thread

Today, I want to provide details regarding non volatile memories and DRAMoptimizations

2


Non volatile memoriesFlash Memories – two types NAND and NOR

Based on standard silicon gatesFloating gates – the charge stored is isolated from insulating layerSo the charge is not lost when power is removedhence non-volatile

The state of a cell is in default “1”Charge must be reversed to store a zeroHigh voltage is applied to reverse the charge for this purpose

Erasing is done on a block levelHence you do not modify data in placeMust find a new block and write new dataold block is marked for block erase

Limited write enduranceLonger latencies on read and even longer on writes

Nov. 29, 2016

11/30/16

2

3


Techniques to extend life of flash memoriesinternal cache memory – SRAM as a buffer

Flash cells are not modified on every write operation

Use error correcting codes to overcome some errorsUniformity of block wear

Nov. 29, 2016

Other non-volatile memories

Phase change memoryMaterial that is in one of two states: Crystalline or amorphousState can be changed by heating the cellThe phase is detected by the differences in electrical resistance

Single level – one bit per cellMulti level – store multiple bits (4 states between crystalline and amorphous to store 2 bits)

4


Nov. 29, 2016

It takes longer to set a cell than to reset a cellSo, one idea is to “preset” a block so that only a few bits need to reset

As soon as data is supplied from PCM to cache, the block is preset.When data is written back to PCM, only a few bits will be reset.

Other ideas: store either true or complement value depending how many 1’s in the block

Similar to Flash memoryLong latencies on read (and even worse on write) but better than flashWrite energy is high

read vs write can be 5-20 times

11/30/16

3

Exploring Phase Change Memory and 3D Die-Stacking for

Power/Thermal Friendly, Fast and Durable Memory Architectures

Zhongjie Chen

5Nov. 29, 2016

Introduction• Using Phase-change Random Access Memory (PRAM) as a promising

candidate to achieve scalable, low power and thermal friendly memorysystem architecture in the upcoming 3D-stacking technology era.

• Use a hybrid PRAM/DRAM memory architecture and exploit an Os-level paging scheme to improve PRAM write performance andlifetime.

• Leverage the error-correcting capability of strong ECC codes toexpand PRAM lifespan and use wear-out aware OS page allocation tominimize ECC performance overhead.

6Nov. 29, 2016

11/30/16

4

Introduction• DRAM technologies are facing both scalability and power issues.

• 3D technology creates multiple active dies stacked vertically on asingle chip and uses short Through Silicon Vias (TSVs) to interconnectcircuits across layers to reduce the wire length and delay.

• To achieve low power, traditional DRAM power managementtechniques attempt to eliminate the unnecessary refreshes or put idlebanks into power saving mode. However, the temperature dependentDRAM leakage can not be overcome.

7Nov. 29, 2016

Introduction• PRAM, Phase-change Random Access Memory, is a type of non-volatile

memory that uses the unique behavior of chalcogenide glass, which can beswitched between two states (i.e. crystalline and amorphous) with theapplication of heat.

• The desirable characteristics of PRAM include random access, fast readaccess, low standby power, superior scalability, compatible with CMOSprocess etc.

• Low standby power is a common feature of all non-volatile storages asdata can be retained in them even when not powered.

• High-temperature friendly operation is a unique characteristic of PRAM:to store data in PRAM, the temperature needs to be elevated to switch thestate of cells.

8Nov. 29, 2016

11/30/16

5

Introduction• Two major challenges are PRAM high write latency and limited

endurance.

• A hybrid main memory design that is composed ofa large portion of PRAM used as a primary memory space anda small portion of DRAM that serves as a write buffer

• OS-level paging scheme

• Error Correction Code

9Nov. 29, 2016

Background

10Nov. 29, 2016

11/30/16

6

Background• The basic structure of a PRAM cell is composed of a standard NMOS

access transistor and a small volume of phase change material, GST.

• PRAM sub-array consists of a number of cells, decoders forrow/column addresses, sense amplifiers (S/As) and write drivers(W/Ds).

11Nov. 29, 2016

PRAM Power Characterization under 3D Integration Technology

• PRAM has substantially reduced read and standby powerconsumption. Overall PRAM power is dominated by its programmingpower.

• One-dimensional heat conduction model

12Nov. 29, 2016

11/30/16

7

PRAM Power Characterization under 3D Integration Technology

13Nov. 29, 2016

The 3D Die-Stacked Hybrid PRAM/DRAM System

14Nov. 29, 2016

11/30/16

8

The 3D Die-Stacked Hybrid PRAM/DRAM System

15Nov. 29, 2016

Life Span Optimization using Varying ECC Strength and OS-level Wear

Leveling

16Nov. 29, 2016

11/30/16

9

Experimental Methodology

17Nov. 29, 2016


18Nov. 29, 2016

11/30/16

10


19Nov. 29, 2016


20Nov. 29, 2016

11/30/16

11

Power Saving of PRAM Technology

21

NC= Normal Cooling AC = Aggressive cooling

Nov. 29, 2016

Power Saving of PRAM Technology

22Nov. 29, 2016

11/30/16

12

Endurance Enhancement• The lifetime of PRAM-based memory is estimated as the number

cycles elapsed before the first memory access failure occurs.

• The applications in the high-miss category stress memory more intensively than others, leading to a shorter lifetime on the baseline design.

23Nov. 29, 2016

Endurance Enhancement

24Nov. 29, 2016

11/30/16

13

Thermal Relief and Performance Benefit

25Nov. 29, 2016

Thermal Relief and Performance Benefit

26Nov. 29, 2016

11/30/16

14

Memory-Intensive Workload Results

27Nov. 29, 2016

Conclusions• The 3D TSVs provide significantly reduced wire length and resistance,

which reduce both PRAM access delay and power consumption.

• The high temperature driven operations and low stand-by power make PRAM more thermal friendly to 3D die stacked memory architecture.

• The power savings reduces 3D thermal constraints and consequently achieves an average 1.07 speedup across all experimented workloads.

28Nov. 29, 2016

11/30/16

15

An Optimized 3D-Stacked Memory Architecture by Exploiting Excessive, High-Density TSV Bandwidth

Dong Hyuk Woo, Nak Hee Song, Dean L. Lewis, & Hsien-Hsin S. Lee

Nov. 29, 2016 29

Background & Introduction

• Motivation– Memory Bandwidth is a

limitation

• 3D Memory-Core Stack may address this issue– DRAM layer = 1GB– 16 layers = 16 GB total– Thermal Considerations

30 / 43Nov. 29, 2016

11/30/16

16

Background & Introduction

• Prior Studies– Increase memory bus width– Increase frequency– More memory channels– Maxing out L2

• Minimize memory traffic• No need for TSVs?

– On-chip needs more memory traffic

– Use TSVs

31 / 43Nov. 29, 2016

Manufacturing Challenges• TSVs

– 4 µm2 vias– Up to several million

TSVs in cm2

• Coupled production– Logic layer & DRAM

together

32 / 43Nov. 29, 2016

11/30/16

17

Memory Bandwidth Challenges• Goodman vs Smith -smaller vs larger cache lines

– Which locality to exploit?• Larger cache line

– False sharing– GHB (global history buffer stride prefetch)– Region prefetching (too complex to implement)– Trailing-edge effect

when a cache line which was referenced earlier has not yet been completely brought into the cache when it is referenced again, causing additional penalty cycles for this second reference- high-density TSV bus will eliminate TEE – without unnecessary complications in hardware or software

33 / 43Nov. 29, 2016

3D-DRAM-Aware ProcessorHuge Bandwidth Opportunities & Challenges

– Fully exploit single-threaded applications• Prior work worst-case

– TSV array under utilized• 1 Tb/s

– “Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed – you can’t bribe God.”

– Latency cured by bandwidth• Bring in more data with every prefetch• Exploit spatial locality with larger cache lines

– Larger cache lines• 64B to 4KB• Cache sizes of 32KB to 8MB

34 / 43Nov. 29, 2016

11/30/16

18

Line Size Effect Experiment Setup

• Benchmarks– SPEC2006(int/fp)– Olden– NU-MineBench

• Grouped Benchmarks– 8 groups– 1 application per group

• Plots– Line Size– MPKI– Cache size

35 / 43Nov. 29, 2016

MCF

36 / 43

32KB

64KB

128KB

256KB

1MB2MB

4MB8MB

02468

1012

64 128 256 512 1024 2048 4096

MPK

I

Line Size

429.mcf

Cache pollution

8MB w/ 64B lines 256KB

w/ 4KB lines

This gives an idea as to how cache misses are dependent on cache size and line sizeProgrammers can improve cache reuse by restructuring data and code

Nov. 29, 2016

11/30/16

19

Group 1

37 / 43

• Soplex– Cache pollution at small cache

sizes– Misses decreases with larger

cache sizes and larger cache line size

– Spatial locality dominates• Good candidate for 3D-DRAM

Nov. 29, 2016

Group 2 vs Group 3

Perlbench LIbquantum

38 / 43

PerlbenchMonotonic increases with line sizesLoses sensitivity with larger cache sizes

- Entire working set fits in cache

LibquantumBad temporalGood spatialLarger cache line always benefits

Nov. 29, 2016

11/30/16

20

Line Size Summary• Most applications benefit from small L1 (64B) line size• Larger cache lines for L2

– Max of 4KB works well for large caches• TEE would eliminate any gains from larger cache lines

– X-wide TSV bus eliminates TEE

39 / 43Nov. 29, 2016

Wide Bus Implementation• 4KB line size

– Reduced miss rate– Increased access latency

40 / 43

64B lines

•• 2^12 bit lines•• 2^11 word lines

4KB lines

•• 2^18 bit lines•• 2^5 word lines

64x

1/64x

64x

8-way 1MB cache

Nov. 29, 2016

11/30/16

21

Conventional Subarray Technique for caches

subarraymat

Cache controller64B read / writerequest from L1$

64B fill / write-back traffic

64B TSVs

64B32B

16B

subbank

41 / 43

Mats – all arrays in a single mat share predecoding logicEach mat is connected to the H-treeWhite line is a line written into 4 mats of a subbank from the controllerCache access latency depends on # of subarrays in row AND column8-way 1MB cache w/4KB lines has 32 rows and 32,768 columns = LONG delays and SUPER wide H-treeFilling 4KB of data will consume 64 cycles of L2 bandwidth – too much time preventing access from processorNov. 29, 2016

SMART-3D Cache Design

64 subbanks(4KB/64B=64)

64B read / writerequest from L1$

42 / 43

Nov. 29, 2016

11/30/16

22

SMART-3D Cache Design4KB fill / write-back traffic(64B per subbank)

One or two cycles higher latency

43 / 43

64 subbanks allows 64 simultaneous 64B operations between L2 and 3D-DRAM

Read/Write operation uses conventional H-tree networkFill uses TSV bus shown here

1-2 cycles slower than optimal cache given a 3GHz clockNov. 29, 2016

Cache-Memory Operation Challenges

• Fetch 64 cache lines simultaneously– 64B or 4KB eviction granularity?– Local LRU or Global LRU?

• They opted for Global LRU– Simpler design of

• MSHR• Write-back buffer

– Conventional associative lookup process• Invalidates

– L2 eviction could evict 64 L1 lines!– Inclusion bit minimizes traffic to L1

• L2 bank & memory controller coupling

44 / 43Nov. 29, 2016

11/30/16

23

Option 1: DRAM Design

45 / 43

256Mb DRAM arrayper tile

256 TSVsper tile

128 array (= one bank) lookup per row buffer miss

256Mb array

2KB row buffer256 TSVs

Option 1Challenge: spacing. 2 um X 2 um with 4 um pitch (pitch= w (width of component) + s (space between components))

Based on standard DRAM design, each tile has a row decoder and column decoderTSV pitches match the width of DRAM tileOnly 256 TSVs allowed within width of DRAM tile32K TSVs means 16 X 8 tilesThis arrangement is power hungry! 128 DRAM tiles simultaneously activated on an L2 miss!Nov. 29, 2016

46 / 43

Option 2: DRAM Design256Mb DRAM arrayper tile

256 TSVsper tile 32k Shared TSVs

Two array (= one bank) lookup per row buffer miss

All TSVs in the middle 128 tiles share a common TSV bus in centerOnly 2 tiles are activated on an L2 miss.

Nov. 29, 2016

11/30/16

24

47 / 43

Option 3: DRAM Design

Folded DRAM layers

256Mb DRAM arrayper tile

32k Shared TSVs256 TSVsper tile

Variation of design 2Split 1 DRAM layer of Design 2 into 4 layers.Each uses only their portion of 8kbit of TSVs for dataRequires larger overall spaceWire length between tiles decreases

Nov. 29, 2016

Multi-Socket Challenges• Potential to make false sharing huge in multi-socket systems

– No shared cache between sockets• Multiple socket machines form NUMAs

– Multiple memory controllers for multiple 3D-DRAM stacks• Solution: Full-page fetching suppressed

– When target page is shared with remote socket– When line of target page cached in L2 on remote socket

• How?– L2 looks at requested physical address– Use Bloom filter at page level

48 / 43Nov. 29, 2016

11/30/16

25

Miss Handling Process• 64B line

– Coherence at 64B line granularity– MESI protocol

• L2 Miss– L2 finds victim page using Global LRU– Allocates sixty-four 64B lines– Creates MSHR entries for lines not cached

• Processor Fetch– 1: Required page is mapped and not shared– 2: Same scenario but Bloom filter bit is set (to count accesses to a page)– 3: Line mapped to remote memory– 4: Another processor delivers line

49 / 43Nov. 29, 2016

Miss Handling• Miss with TEE

– Miss waiting for coherence response from other processors• Generates conventional coherence message

– Previous miss fetching entire page• Wait for page to be fetched

– Previous fetching 64B line alone• Generate a second separate miss handling process

50 / 43Nov. 29, 2016

11/30/16

26

Miss Handling• Processor Receives Request

– Line from page waiting for Bloom filter data of other processors• Initial requesting processor settles for 64B line instead of page

– Line from page is already being fetched from main memory• Initial requester responds to second after completed caching

51 / 43Nov. 29, 2016

Evaluation• Simulation Framework

– SuperESCalar Simulator

– Single-Core-Single Threaded & Multi-core Multi-Application• SPEC2006, NU-MineBench, & Olden Benchmarks

– Multi-threaded Applications• SPLASH-2

– Processor• 3GHz, 4-wide, 14-stage OoO processor• L1-I: 2-way, 64B line, 32KB, 1-cycle latency• L1-D: 2-way, 64B line, 32KB, 2-cycle latency

52 / 43Nov. 29, 2016

11/30/16

27

Configurations

53 / 43

••L2$: 8-way, 64B line, 1MB, 6-cycle••DRAM: 8B-wide, 12.8 GBps bus, 350-cycle2D-Base

••2D-Base + Global History Buffer prefetcher2D-GHB

••2D-Base + Virtual Line Scheme w/o software control2D-VLS

••Same L2$ as 2D-Base••DRAM: 64B-wide, 250-cycle3D-Base

••3D-Base + GHB3D-GHB

••L2$: 8-way, (4KB line), 1MB, 7-cycle••DRAM: 4KB-wide, 250-cycleSMART-3D

Nov. 29, 2016

Single Core Results

012345678

429.

mcf

462.

libqu

antu

m

471.

omne

tpp

473.

asta

r

483.

xala

ncbm

k

410.

bwav

es

433.

milc

436.

cact

usA

DM

437.

lesli

e3d

450.

sopl

ex

459.

Gem

sFD

TD

482.

sphi

nx3

Geo

mea

n(M

I)

Spee

dup

2D-GHB 2D-VLS 3D Base 3D-GHB SMART-3D Perfect L21.40x 0.82x 1.25x 1.69x 2.14x 2.83x

Bad spatial locality

54 / 43

SMART-3D looks outstanding despite the additional access latency vs 3D Base

2D-Virtual Line Scheme performs poorly because of TEE penaltiesDemonstrates that conventional architecture cannot utilize a 4KB line size

Drawbacks of SMART-3D: increased conflict misses, cactus and astar had increased MPKI (if you go back to previous charts) – a 3D-DRAM aware compiler fixes this Nov. 29, 2016

11/30/16

28

Performance & Area

55 / 43

SMART-3D requires more areaeight 64B lines in 64-subbank MSHR, approx 32KB is 64X larger than baseline

1MB L2sBigger L2 just not worth it.

See on next slide that 1MB L2 is probably the performance sweet spot for SMART-3D

Nov. 29, 2016

Cache Size Sensitivity

56 / 43

Speedup on SMART-3D 8-way L2 from 128KB to 2MB – using given latencies from Table 36 cycles for 128KB, 256KB, 512KB7 for 1MB8 for 2MB

Nov. 29, 2016

11/30/16

29

Dual-Core Results

0

1

2

3

4

5

6 41

…

41…

41…

42…

42…

43…

43…

43…

43…

43…

43…

45…

45…

46…

46…

46…

Ge…

Spee

dup

3D-GHB (4MB) SMART-3D (1MB) SMART-3D (2MB) Perfect L2

Baseline: 2D-Base (2MB)

1.96x 2.31x 2.40x 3.12x

57 / 43Nov. 29, 2016

4-Core System – Multi-Program

58 / 43

the trend continues as far as SMART-3Ds performance dominance – poor spatial locality of cactusADM degrades performance

Nov. 29, 2016

11/30/16

30

01234567

2C 4C 8C 2C 4C 8C 2C 4C 8C 2C 4C 8C 2C 4C 8C 2C 4C 8C

barnes cholesky fft fmm lu ocean

Spee

dup 2D Base (nMB) 3D Base (nMB) SMART-3D (nMB) Perfect L2

Multi-Threaded on Multi-Core

01234567

2C 4C 8C 2C 4C 8C 2C 4C 8C 2C 4C 8C 2C 4C 8C 2C 4C 8C 2C 4C 8C

radiosity radix raytrace volrend water-n2 water-sp Geomean

Spee

dup

59 / 43

SPLASH-2 Benchmarksscaling of L22, 4, & 8 core processors

fmm and water-n2 don’t show the same kinds of improvements – they are compute intensiveNov. 29, 2016

Multi-Socket

0 1 2 3 4 5

2D Base (1MB each) 3D Base (1MB each)SMART-3D (1MB each) Perfect L2

60 / 43

model 2-socket single core system connected with off-chip busassume write-through L1MESI protocol between L2s on different socketsno compiler optimizations – raytrace performed so poorly because of L2 misses and few shares in each page

Nov. 29, 2016

11/30/16

31

Energy Consumption

61 / 43

• Less DRAM lookup energy• More L2 R/W energy• More L2 FILL/WB energy• More Bus FILL/WB energy

Nov. 29, 2016

Energy & Relative Traffic

62 / 43Nov. 29, 2016

11/30/16

32

Dynamic Energy Consumption

63 / 43Nov. 29, 2016

Energy Result

0 0.2 0.4 0.6 0.8 1

3D-Base

SMART-3D

462.libquantum

L2 Read L2 WriteL2 Fill L2 Write-backTSVs for DRAM bus Wires on a DRAM layerDRAM array lookup

64 / 43Nov. 29, 2016

11/30/16

33

Conclusions

65 / 43

• Stacking memory not enough

• Larger fetches

• More energy efficient

• 2X speedup

Nov. 29, 2016

Micron HMC

Using TSVs in off-chip HMCIncreased bandwidth still needed

– 15X DDR3• Lower Energy

– 70% less energy• Memory Footprint

– 90% reduction vs RDIMMs 66 / 43Nov. 29, 2016

11/30/16

34

References• http://www.micron.com/products/hybrid-memory-cube

• Loh, Gabriel H., “3D-Stacked Memory Architectures for Multi-Core Processors”, International Symposium on Computer Architecture, 2008.

• Loi, G.L., Agrawal, B., Srivastava, N., Lin, S-C., Sherwood, T., and Banerjee, K., “A Thermally-Aware Performance Analysis of Vertically Integrated (3-D) Processor-Memory Hierarchy”, DAC. 2006.

• http://www.dhwoo.net/research

67 / 43Nov. 29, 2016

Documents

Nov-29 · • 3D technology creates multiple active dies stacked vertically on a single chip and uses short Through Silicon Vias (TSVs) to interconnect circuits across layers to reduce