Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
11/30/16
1
1
CSCE 6610:Advanced Computer Architecture
Project presentation on Tuesday Dec 6Also need your project reports on Dec 6
Take home final exam assigned on Dec 6Due on Dec 13.
Since it is a take home, I expect a comprehensive or detailed discussions
Nov. 29, 2016
Last week we discussed Decoupled Software Pipeliningto use multiple cores for a single thread
Today, I want to provide details regarding non volatile memories and DRAMoptimizations
2
CSCE 6610:Advanced Computer Architecture
Non volatile memoriesFlash Memories – two types NAND and NOR
Based on standard silicon gatesFloating gates – the charge stored is isolated from insulating layerSo the charge is not lost when power is removedhence non-volatile
The state of a cell is in default “1”Charge must be reversed to store a zeroHigh voltage is applied to reverse the charge for this purpose
Erasing is done on a block levelHence you do not modify data in placeMust find a new block and write new dataold block is marked for block erase
Limited write enduranceLonger latencies on read and even longer on writes
Nov. 29, 2016
11/30/16
2
3
CSCE 6610:Advanced Computer Architecture
Techniques to extend life of flash memoriesinternal cache memory – SRAM as a buffer
Flash cells are not modified on every write operation
Use error correcting codes to overcome some errorsUniformity of block wear
Nov. 29, 2016
Other non-volatile memories
Phase change memoryMaterial that is in one of two states: Crystalline or amorphousState can be changed by heating the cellThe phase is detected by the differences in electrical resistance
Single level – one bit per cellMulti level – store multiple bits (4 states between crystalline and amorphous to store 2 bits)
4
CSCE 6610:Advanced Computer Architecture
Nov. 29, 2016
It takes longer to set a cell than to reset a cellSo, one idea is to “preset” a block so that only a few bits need to reset
As soon as data is supplied from PCM to cache, the block is preset.When data is written back to PCM, only a few bits will be reset.
Other ideas: store either true or complement value depending how many 1’s in the block
Similar to Flash memoryLong latencies on read (and even worse on write) but better than flashWrite energy is high
read vs write can be 5-20 times
11/30/16
3
Exploring Phase Change Memory and 3D Die-Stacking for
Power/Thermal Friendly, Fast and Durable Memory Architectures
Zhongjie Chen
5Nov. 29, 2016
Introduction• Using Phase-change Random Access Memory (PRAM) as a promising
candidate to achieve scalable, low power and thermal friendly memorysystem architecture in the upcoming 3D-stacking technology era.
• Use a hybrid PRAM/DRAM memory architecture and exploit an Os-level paging scheme to improve PRAM write performance andlifetime.
• Leverage the error-correcting capability of strong ECC codes toexpand PRAM lifespan and use wear-out aware OS page allocation tominimize ECC performance overhead.
6Nov. 29, 2016
11/30/16
4
Introduction• DRAM technologies are facing both scalability and power issues.
• 3D technology creates multiple active dies stacked vertically on asingle chip and uses short Through Silicon Vias (TSVs) to interconnectcircuits across layers to reduce the wire length and delay.
• To achieve low power, traditional DRAM power managementtechniques attempt to eliminate the unnecessary refreshes or put idlebanks into power saving mode. However, the temperature dependentDRAM leakage can not be overcome.
7Nov. 29, 2016
Introduction• PRAM, Phase-change Random Access Memory, is a type of non-volatile
memory that uses the unique behavior of chalcogenide glass, which can beswitched between two states (i.e. crystalline and amorphous) with theapplication of heat.
• The desirable characteristics of PRAM include random access, fast readaccess, low standby power, superior scalability, compatible with CMOSprocess etc.
• Low standby power is a common feature of all non-volatile storages asdata can be retained in them even when not powered.
• High-temperature friendly operation is a unique characteristic of PRAM:to store data in PRAM, the temperature needs to be elevated to switch thestate of cells.
8Nov. 29, 2016
11/30/16
5
Introduction• Two major challenges are PRAM high write latency and limited
endurance.
• A hybrid main memory design that is composed ofa large portion of PRAM used as a primary memory space anda small portion of DRAM that serves as a write buffer
• OS-level paging scheme
• Error Correction Code
9Nov. 29, 2016
Background
10Nov. 29, 2016
11/30/16
6
Background• The basic structure of a PRAM cell is composed of a standard NMOS
access transistor and a small volume of phase change material, GST.
• PRAM sub-array consists of a number of cells, decoders forrow/column addresses, sense amplifiers (S/As) and write drivers(W/Ds).
11Nov. 29, 2016
PRAM Power Characterization under 3D Integration Technology
• PRAM has substantially reduced read and standby powerconsumption. Overall PRAM power is dominated by its programmingpower.
• One-dimensional heat conduction model
12Nov. 29, 2016
11/30/16
7
PRAM Power Characterization under 3D Integration Technology
13Nov. 29, 2016
The 3D Die-Stacked Hybrid PRAM/DRAM System
14Nov. 29, 2016
11/30/16
8
The 3D Die-Stacked Hybrid PRAM/DRAM System
15Nov. 29, 2016
Life Span Optimization using Varying ECC Strength and OS-level Wear
Leveling
16Nov. 29, 2016
11/30/16
9
Experimental Methodology
17Nov. 29, 2016
Experimental Methodology
18Nov. 29, 2016
11/30/16
10
Experimental Methodology
19Nov. 29, 2016
Experimental Methodology
20Nov. 29, 2016
11/30/16
11
Power Saving of PRAM Technology
21
NC= Normal Cooling AC = Aggressive cooling
Nov. 29, 2016
Power Saving of PRAM Technology
22Nov. 29, 2016
11/30/16
12
Endurance Enhancement• The lifetime of PRAM-based memory is estimated as the number
cycles elapsed before the first memory access failure occurs.
• The applications in the high-miss category stress memory more intensively than others, leading to a shorter lifetime on the baseline design.
23Nov. 29, 2016
Endurance Enhancement
24Nov. 29, 2016
11/30/16
13
Thermal Relief and Performance Benefit
25Nov. 29, 2016
Thermal Relief and Performance Benefit
26Nov. 29, 2016
11/30/16
14
Memory-Intensive Workload Results
27Nov. 29, 2016
Conclusions• The 3D TSVs provide significantly reduced wire length and resistance,
which reduce both PRAM access delay and power consumption.
• The high temperature driven operations and low stand-by power make PRAM more thermal friendly to 3D die stacked memory architecture.
• The power savings reduces 3D thermal constraints and consequently achieves an average 1.07 speedup across all experimented workloads.
28Nov. 29, 2016
11/30/16
15
An Optimized 3D-Stacked Memory Architecture by Exploiting Excessive, High-Density TSV Bandwidth
Dong Hyuk Woo, Nak Hee Song, Dean L. Lewis, & Hsien-Hsin S. Lee
Nov. 29, 2016 29
Background & Introduction
• Motivation– Memory Bandwidth is a
limitation
• 3D Memory-Core Stack may address this issue– DRAM layer = 1GB– 16 layers = 16 GB total– Thermal Considerations
30 / 43Nov. 29, 2016
11/30/16
16
Background & Introduction
• Prior Studies– Increase memory bus width– Increase frequency– More memory channels– Maxing out L2
• Minimize memory traffic• No need for TSVs?
– On-chip needs more memory traffic
– Use TSVs
31 / 43Nov. 29, 2016
Manufacturing Challenges• TSVs
– 4 µm2 vias– Up to several million
TSVs in cm2
• Coupled production– Logic layer & DRAM
together
32 / 43Nov. 29, 2016
11/30/16
17
Memory Bandwidth Challenges• Goodman vs Smith -smaller vs larger cache lines
– Which locality to exploit?• Larger cache line
– False sharing– GHB (global history buffer stride prefetch)– Region prefetching (too complex to implement)– Trailing-edge effect
when a cache line which was referenced earlier has not yet been completely brought into the cache when it is referenced again, causing additional penalty cycles for this second reference- high-density TSV bus will eliminate TEE – without unnecessary complications in hardware or software
33 / 43Nov. 29, 2016
3D-DRAM-Aware ProcessorHuge Bandwidth Opportunities & Challenges
– Fully exploit single-threaded applications• Prior work worst-case
– TSV array under utilized• 1 Tb/s
– “Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed – you can’t bribe God.”
– Latency cured by bandwidth• Bring in more data with every prefetch• Exploit spatial locality with larger cache lines
– Larger cache lines• 64B to 4KB• Cache sizes of 32KB to 8MB
34 / 43Nov. 29, 2016
11/30/16
18
Line Size Effect Experiment Setup
• Benchmarks– SPEC2006(int/fp)– Olden– NU-MineBench
• Grouped Benchmarks– 8 groups– 1 application per group
• Plots– Line Size– MPKI– Cache size
35 / 43Nov. 29, 2016
MCF
36 / 43
32KB
64KB
128KB
256KB
1MB2MB
4MB8MB
02468
1012
64 128 256 512 1024 2048 4096
MPK
I
Line Size
429.mcf
Cache pollution
8MB w/ 64B lines 256KB
w/ 4KB lines
This gives an idea as to how cache misses are dependent on cache size and line sizeProgrammers can improve cache reuse by restructuring data and code
Nov. 29, 2016
11/30/16
19
Group 1
37 / 43
• Soplex– Cache pollution at small cache
sizes– Misses decreases with larger
cache sizes and larger cache line size
– Spatial locality dominates• Good candidate for 3D-DRAM
Nov. 29, 2016
Group 2 vs Group 3
Perlbench LIbquantum
38 / 43
PerlbenchMonotonic increases with line sizesLoses sensitivity with larger cache sizes
- Entire working set fits in cache
LibquantumBad temporalGood spatialLarger cache line always benefits
Nov. 29, 2016
11/30/16
20
Line Size Summary• Most applications benefit from small L1 (64B) line size• Larger cache lines for L2
– Max of 4KB works well for large caches• TEE would eliminate any gains from larger cache lines
– X-wide TSV bus eliminates TEE
39 / 43Nov. 29, 2016
Wide Bus Implementation• 4KB line size
– Reduced miss rate– Increased access latency
40 / 43
64B lines
•• 2^12 bit lines•• 2^11 word lines
4KB lines
•• 2^18 bit lines•• 2^5 word lines
64x
1/64x
64x
8-way 1MB cache
Nov. 29, 2016
11/30/16
21
Conventional Subarray Technique for caches
subarraymat
Cache controller64B read / writerequest from L1$
64B fill / write-back traffic
64B TSVs
64B32B
16B
subbank
41 / 43
Mats – all arrays in a single mat share predecoding logicEach mat is connected to the H-treeWhite line is a line written into 4 mats of a subbank from the controllerCache access latency depends on # of subarrays in row AND column8-way 1MB cache w/4KB lines has 32 rows and 32,768 columns = LONG delays and SUPER wide H-treeFilling 4KB of data will consume 64 cycles of L2 bandwidth – too much time preventing access from processorNov. 29, 2016
SMART-3D Cache Design
64 subbanks(4KB/64B=64)
64B read / writerequest from L1$
42 / 43
Nov. 29, 2016
11/30/16
22
SMART-3D Cache Design4KB fill / write-back traffic(64B per subbank)
One or two cycles higher latency
43 / 43
64 subbanks allows 64 simultaneous 64B operations between L2 and 3D-DRAM
Read/Write operation uses conventional H-tree networkFill uses TSV bus shown here
1-2 cycles slower than optimal cache given a 3GHz clockNov. 29, 2016
Cache-Memory Operation Challenges
• Fetch 64 cache lines simultaneously– 64B or 4KB eviction granularity?– Local LRU or Global LRU?
• They opted for Global LRU– Simpler design of
• MSHR• Write-back buffer
– Conventional associative lookup process• Invalidates
– L2 eviction could evict 64 L1 lines!– Inclusion bit minimizes traffic to L1
• L2 bank & memory controller coupling
44 / 43Nov. 29, 2016
11/30/16
23
Option 1: DRAM Design
45 / 43
256Mb DRAM arrayper tile
256 TSVsper tile
128 array (= one bank) lookup per row buffer miss
256Mb array
2KB row buffer256 TSVs
Option 1Challenge: spacing. 2 um X 2 um with 4 um pitch (pitch= w (width of component) + s (space between components))
Based on standard DRAM design, each tile has a row decoder and column decoderTSV pitches match the width of DRAM tileOnly 256 TSVs allowed within width of DRAM tile32K TSVs means 16 X 8 tilesThis arrangement is power hungry! 128 DRAM tiles simultaneously activated on an L2 miss!Nov. 29, 2016
46 / 43
Option 2: DRAM Design256Mb DRAM arrayper tile
256 TSVsper tile 32k Shared TSVs
Two array (= one bank) lookup per row buffer miss
All TSVs in the middle 128 tiles share a common TSV bus in centerOnly 2 tiles are activated on an L2 miss.
Nov. 29, 2016
11/30/16
24
47 / 43
Option 3: DRAM Design
Folded DRAM layers
256Mb DRAM arrayper tile
32k Shared TSVs256 TSVsper tile
Variation of design 2Split 1 DRAM layer of Design 2 into 4 layers.Each uses only their portion of 8kbit of TSVs for dataRequires larger overall spaceWire length between tiles decreases
Nov. 29, 2016
Multi-Socket Challenges• Potential to make false sharing huge in multi-socket systems
– No shared cache between sockets• Multiple socket machines form NUMAs
– Multiple memory controllers for multiple 3D-DRAM stacks• Solution: Full-page fetching suppressed
– When target page is shared with remote socket– When line of target page cached in L2 on remote socket
• How?– L2 looks at requested physical address– Use Bloom filter at page level
48 / 43Nov. 29, 2016
11/30/16
25
Miss Handling Process• 64B line
– Coherence at 64B line granularity– MESI protocol
• L2 Miss– L2 finds victim page using Global LRU– Allocates sixty-four 64B lines– Creates MSHR entries for lines not cached
• Processor Fetch– 1: Required page is mapped and not shared– 2: Same scenario but Bloom filter bit is set (to count accesses to a page)– 3: Line mapped to remote memory– 4: Another processor delivers line
49 / 43Nov. 29, 2016
Miss Handling• Miss with TEE
– Miss waiting for coherence response from other processors• Generates conventional coherence message
– Previous miss fetching entire page• Wait for page to be fetched
– Previous fetching 64B line alone• Generate a second separate miss handling process
50 / 43Nov. 29, 2016
11/30/16
26
Miss Handling• Processor Receives Request
– Line from page waiting for Bloom filter data of other processors• Initial requesting processor settles for 64B line instead of page
– Line from page is already being fetched from main memory• Initial requester responds to second after completed caching
51 / 43Nov. 29, 2016
Evaluation• Simulation Framework
– SuperESCalar Simulator
– Single-Core-Single Threaded & Multi-core Multi-Application• SPEC2006, NU-MineBench, & Olden Benchmarks
– Multi-threaded Applications• SPLASH-2
– Processor• 3GHz, 4-wide, 14-stage OoO processor• L1-I: 2-way, 64B line, 32KB, 1-cycle latency• L1-D: 2-way, 64B line, 32KB, 2-cycle latency
52 / 43Nov. 29, 2016
11/30/16
27
Configurations
53 / 43
••L2$: 8-way, 64B line, 1MB, 6-cycle••DRAM: 8B-wide, 12.8 GBps bus, 350-cycle2D-Base
••2D-Base + Global History Buffer prefetcher2D-GHB
••2D-Base + Virtual Line Scheme w/o software control2D-VLS
••Same L2$ as 2D-Base••DRAM: 64B-wide, 250-cycle3D-Base
••3D-Base + GHB3D-GHB
••L2$: 8-way, (4KB line), 1MB, 7-cycle••DRAM: 4KB-wide, 250-cycleSMART-3D
Nov. 29, 2016
Single Core Results
012345678
429.
mcf
462.
libqu
antu
m
471.
omne
tpp
473.
asta
r
483.
xala
ncbm
k
410.
bwav
es
433.
milc
436.
cact
usA
DM
437.
lesli
e3d
450.
sopl
ex
459.
Gem
sFD
TD
482.
sphi
nx3
Geo
mea
n(M
I)
Spee
dup
2D-GHB 2D-VLS 3D Base 3D-GHB SMART-3D Perfect L21.40x 0.82x 1.25x 1.69x 2.14x 2.83x
Bad spatial locality
54 / 43
SMART-3D looks outstanding despite the additional access latency vs 3D Base
2D-Virtual Line Scheme performs poorly because of TEE penaltiesDemonstrates that conventional architecture cannot utilize a 4KB line size
Drawbacks of SMART-3D: increased conflict misses, cactus and astar had increased MPKI (if you go back to previous charts) – a 3D-DRAM aware compiler fixes this Nov. 29, 2016
11/30/16
28
Performance & Area
55 / 43
SMART-3D requires more areaeight 64B lines in 64-subbank MSHR, approx 32KB is 64X larger than baseline
1MB L2sBigger L2 just not worth it.
See on next slide that 1MB L2 is probably the performance sweet spot for SMART-3D
Nov. 29, 2016
Cache Size Sensitivity
56 / 43
Speedup on SMART-3D 8-way L2 from 128KB to 2MB – using given latencies from Table 36 cycles for 128KB, 256KB, 512KB7 for 1MB8 for 2MB
Nov. 29, 2016
11/30/16
29
Dual-Core Results
0
1
2
3
4
5
6 41
…
41…
41…
42…
42…
43…
43…
43…
43…
43…
43…
45…
45…
46…
46…
46…
Ge…
Spee
dup
3D-GHB (4MB) SMART-3D (1MB) SMART-3D (2MB) Perfect L2
Baseline: 2D-Base (2MB)
1.96x 2.31x 2.40x 3.12x
57 / 43Nov. 29, 2016
4-Core System – Multi-Program
58 / 43
the trend continues as far as SMART-3Ds performance dominance – poor spatial locality of cactusADM degrades performance
Nov. 29, 2016
11/30/16
30
01234567
2C 4C 8C 2C 4C 8C 2C 4C 8C 2C 4C 8C 2C 4C 8C 2C 4C 8C
barnes cholesky fft fmm lu ocean
Spee
dup 2D Base (nMB) 3D Base (nMB) SMART-3D (nMB) Perfect L2
Multi-Threaded on Multi-Core
01234567
2C 4C 8C 2C 4C 8C 2C 4C 8C 2C 4C 8C 2C 4C 8C 2C 4C 8C 2C 4C 8C
radiosity radix raytrace volrend water-n2 water-sp Geomean
Spee
dup
59 / 43
SPLASH-2 Benchmarksscaling of L22, 4, & 8 core processors
fmm and water-n2 don’t show the same kinds of improvements – they are compute intensiveNov. 29, 2016
Multi-Socket
0 1 2 3 4 5
2D Base (1MB each) 3D Base (1MB each)SMART-3D (1MB each) Perfect L2
60 / 43
model 2-socket single core system connected with off-chip busassume write-through L1MESI protocol between L2s on different socketsno compiler optimizations – raytrace performed so poorly because of L2 misses and few shares in each page
Nov. 29, 2016
11/30/16
31
Energy Consumption
61 / 43
• Less DRAM lookup energy• More L2 R/W energy• More L2 FILL/WB energy• More Bus FILL/WB energy
Nov. 29, 2016
Energy & Relative Traffic
62 / 43Nov. 29, 2016
11/30/16
32
Dynamic Energy Consumption
63 / 43Nov. 29, 2016
Energy Result
0 0.2 0.4 0.6 0.8 1
3D-Base
SMART-3D
462.libquantum
L2 Read L2 WriteL2 Fill L2 Write-backTSVs for DRAM bus Wires on a DRAM layerDRAM array lookup
64 / 43Nov. 29, 2016
11/30/16
33
Conclusions
65 / 43
• Stacking memory not enough
• Larger fetches
• More energy efficient
• 2X speedup
Nov. 29, 2016
Micron HMC
Using TSVs in off-chip HMCIncreased bandwidth still needed
– 15X DDR3• Lower Energy
– 70% less energy• Memory Footprint
– 90% reduction vs RDIMMs 66 / 43Nov. 29, 2016
11/30/16
34
References• http://www.micron.com/products/hybrid-memory-cube
• Loh, Gabriel H., “3D-Stacked Memory Architectures for Multi-Core Processors”, International Symposium on Computer Architecture, 2008.
• Loi, G.L., Agrawal, B., Srivastava, N., Lin, S-C., Sherwood, T., and Banerjee, K., “A Thermally-Aware Performance Analysis of Vertically Integrated (3-D) Processor-Memory Hierarchy”, DAC. 2006.
• http://www.dhwoo.net/research
67 / 43Nov. 29, 2016