Upload
cori-white
View
221
Download
4
Embed Size (px)
DESCRIPTION
Technology Evolution Wire delay, ns/cm. Evolutionary growth but its effects are subtle and powerful! Industry continues to outpace NTRS projections on technology scaling and IC density.
Citation preview
AMRM:AMRM:Project Technical ApproachProject Technical Approach
Rajesh Gupta
Project Kickoff MeetingNovember 5, 1998Washington DC
A Technology and Architectural View of AdaptationA Technology and Architectural View of Adaptation
OutlineOutline
• Technology trends driving this project– Changing ground-rules in high-performance system design– Rethinking circuits and microelectronic system design
• Rethinking architectures– The opportunity of application-adaptive architectures– Adaptation Challenges
• Adaptation for memory hierarchy– Why memory hierarchy? – Adaptation space and possible gains
• Summary
Technology EvolutionTechnology Evolution
0102030405060708090
89 92 95 98 1 4 7
Wire delay, ns/cm.
Evolutionary growth but its effects are subtle and powerful!Evolutionary growth but its effects are subtle and powerful!
Industry continues to outpace NTRS projections on Industry continues to outpace NTRS projections on technology scaling and IC density.technology scaling and IC density.
Feature Size
250
180
130100
70
0
50
100
150
200
250
300
97 98 99 0 1 2 3 4 5 6 7 8 9 12Year of Shipment
NTRS-94NTRS-97
Average interconnect delay is greater than the gate delays!Average interconnect delay is greater than the gate delays!• Reduced marginal cost of logic and signal regeneration needs make it possible to include logic in inter-block interconnect.
Consider InterconnectConsider Interconnect
Dynamic interconnect
Static interconnect
I II III
Avg. Interconnect Length (scales with pitch)
Critical Length
CROSS-OVER REGION
Feature Size (nm)
Length (um)
1000
2000
3000
1000 100
3000 u
500nm
100 u
100nm
Rethinking Circuits Rethinking Circuits When Interconnect DominatesWhen Interconnect Dominates
• DEVICE:DEVICE: Choose better interconnect– Copper, low temperature interconnect
• CAD:CAD: Choose better interconnect topology, sizes– Minimize path from driver gate to each receiver gate
» e.g., A-tree algorithm yields about 12% reduction in delay– Select wire sizes to minimize net delays
» e.g., upto 35% reduction in delay by optimal sizing algorithms
• CKT:CKT: Use more signal repeaters in block-level designs– longest interconnect=2000 mu for 350nm process
• u-ARCH:u-ARCH: A storage element no longer defines a clock boundary
– Multiple storage elements in a single clock– Multiple state transitions in a clock period– Storage-controlled routing– Reduced marginal cost of logic
Implications: Circuit BlocksImplications: Circuit Blocks• Frequent use of signal repeaters in block-level
designs– longest interconnect=2000 u for 0.3 u process
• A storage element no longer (always) defines a clock boundary– storage delay (=1.5x switching delay)– multiple storage elements in a single clock– multiple state transitions in a clock period– storage-controlled routing
• Circuit block designs that work independently of data latencies– asynchronous blocks
• Heterogenous clocking interfaces– pausible clocking [Yun, ICCD96]– mixed synchronous, asynchronous circuit blocks.
Implications: ArchitecturesImplications: Architectures
• Architectures to exploit interconnect delays– pipeline interconnect delays [recall Cray-2]
» cycle time = max delay - min delay– use interconnect delay as the minimum delay– need P&R estimates early in the design
• Algorithms that use interconnect latencies– interconnect as functional units– functional unit schedules are based on a measure of spatial
distances
• Increase local decision making– multiple state transitions in a clock period– storage-controlled routing – re-programmable blocks in “custom layouts”
Opportunity: Application-Adaptive Opportunity: Application-Adaptive ArchitecturesArchitectures
• Exploit architectural “low-hanging” fruits– performance variation across applications (10-100X)– performance variation across data-sets (10X)
• Use interconnect and data-path reconfiguration to– increase performance– combat performance fragility and – improve fault tolerance
• Configurable hardware is used to improve utilization of performance critical resources
– instead of using configurable hardware to build additional resources
– design goal is to achieve peak performance across applications– configurable hardware leveraged in efficient utilization of
performance critical resources
Architectural AdaptationArchitectural Adaptation
• Each of the following elements can benefit from increased adaptability (above and beyond CPU programming)
– CPU– Memory hierarchy : eliminate false sharing– Memory system : virtual memory layout based on cache miss data– IO : disk layout based on access pattern– Network interface : scheduling to reduced end-to-end latency
• Adaptability used to build– programmable engines in IO, memory controllers, cache controllers,
network devices– configurable data-paths and logic in any part of the system– configurable queueing in scheduling for interconnect, devices, memory– smart interfaces for information flow from applications to hardware– performance monitoring and coordinated resource management...
Intelligent interfaces, information formats, mechanisms and policies.
Adaptation ChallengesAdaptation Challenges
• Is application-driven adaptation viable from technology and cost point of view?
• How to structure adaptability– to maximize the performance benefits– provide protection, multitasking and a reasonable programming
environment– enable easy exploitation of adaptability through automatic or
semi-automatic means.
• We focus on memory hierarchy as the first candidate to explore the extent and utility of adaptation.
Alpha 21164 Execution Time Components
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
TPC-C SPECint95 SPECfp95
Benchmark
Other stalls
L3 Cache stalls
L2 cache stallsL1 cache stalls
Computation
Why Cache Memory?Why Cache Memory?
4-year technological scaling4-year technological scaling
• CPU performance increases by 47% per year• DRAM performance increases by 7%per year• Assume the Alpha is scaled using this scaling and
– Organization remains 8KB/96KB/4MB/mem– Benchmarks requirements are same
• Expect something similar if both L2/L3 cache size and benchmark size will increase
21164 scaled to .15 micron
0
0.1
0.2
0.3
0.4
0.5
0.6
TPC-C SPECint95 SPECfp95
Benchmark
% o
rigin
al e
xecu
tion
time
Other stallsL3 Cache stallsL2 cache stallsL1 cache stallsComputation
Impact of Memory StallsImpact of Memory Stalls
• A statically scheduled processor with a blocking cache stalls, on average, for
– 15% of the time in integer benchmarks– 43% of the time in f.p. benchmarks– 70% of the time in transaction benchmark
• Possible performance improvements due to improved memory hierarchy without technology scaling:
– 1.17x, – 1.89x, and – 3.33x
• Possible improvements with technology scaling – 2.4x, 7.5x, and 20x
0
50
100
150
200
250
cg.S fft lu spark sparse-400 ocean-con ocean-nonoptimized unlimited perfect unlimited-64byte
Opportunities for Adaptivity in CachesOpportunities for Adaptivity in Caches
• Cache organization• Cache performance “assist” mechanisms• Hierarchy organization• Memory organization (DRAM, etc)• Data layout and address mapping• Virtual Memory• Compiler assist
Opportunities - Cont’dOpportunities - Cont’d
• Cache organization: adapt what?– Size: NO– Associativity: NO– Line size: MAYBE, – Write policy: YES (fetch,allocate,w-back/thru)– Mapping function: MAYBE
• Organization, clock rate optimized together
Opportunities - Cont’dOpportunities - Cont’d
• Cache “Assist”: prefetch, write buffer, victim cache, etc. between different levels
– due to delay/size constraint, all of the above cannot be implemented together
– improvement as f(size) may not be at max_size
• Adapt what?– which mechanism(s) to use, algorithms– mechanism “parameters”: size, lookahead, etc
Opportunities - Cont’dOpportunities - Cont’d
• Hierarchy Organization:– Where are cache assist mechanisms applied?
» Between L1 and L2» Between L1 and Memory» Between L2 and Memory...
– What are the datapaths like?» Is prefetch, victim cache, write buffer data written into a
next-level cache? » How much parallelism is possible in the hiearchy?
Opportunities - Cont’dOpportunities - Cont’d
• Memory Organization– Cached DRAM?
» yes, but very limited configurations– Interleave change?
» Hard to accomplish dynamically– Tagged memory
» Keep state for adaptivity
Opportunities - Cont’dOpportunities - Cont’d
• Data layout and address mapping– In theory, something can be done but
» would require time-consuming data re-arrangement– MP case is even worse– Adaptive address mapping or hashing
» based on what?
Opportunities - Cont’dOpportunities - Cont’d
• Compiler assist can– Select initial hardware configuration– Pass hints on to hardware– Generate code to collect run-time info and adapt during execution– Adapt configuration after being “called” at certain intervals
during execution– Re-optimize code at run-time
Opportunities - Cont’dOpportunities - Cont’d
• Virtual Memory can adapt– Page size?– Mapping?– Page prefetching/read ahead– Write buffer (file cache)– The above under multiprogramming?
Applying AdaptivityApplying Adaptivity
• What Drives Adaptivity? • Performance impact, overall and/or relative
» “Effectiveness”, e.g. miss rate» Processor stall introduced» Program characteristics
• When to perform adaptive action?– Run time: use feedback from hardware– Compile time: insert code, set up hardware
Where to Implement Adaptivity?Where to Implement Adaptivity?
• In Software: compiler and/or OS– (Static) Knowledge of program behavior– Factored into optimization and scheduling– Extra code, overhead– Lack of dynamic run-time information– Rate of adaptivity– Requires recompilation, OS changes
Where to Implement?- Cont’dWhere to Implement?- Cont’d
• Hardware– dynamic information available– fast decision mechanism possible– transparent to software (thus safe)– delay, clock rate limit algorithm complexity– difficult to maintain long-term trends– little knowledge of program behavior
Where to Implement - Cont’dWhere to Implement - Cont’d
• Hardware/software– Software can set coarse hardware parameters– Hardware can supply software dynamic info– Perhaps more complex algorithms can be used– Software modification required– Communication mechanism required
Current InvestigationCurrent Investigation
• L1 cache assist– See wide variability in assist mechanisms’ effectiveness between
» Individual Programs» Within a program as a function of time
– Propose a hardware mechanism to select between assist types and allocate buffer space
– Give compiler an opportunity to set parameters
Mechanisms Used (L1 to L2)Mechanisms Used (L1 to L2)
• Prefetching– Stream Buffers– Stride-directed, based on address alone– Miss Stride: prefetch the same addr using the number of
intervening misses as lookahead– Pointer Stride
• Victim Cache• Write Buffer
Mechanisms Used - Cont’dMechanisms Used - Cont’d
• A mechanism can be used by itself– Which is most effective?
• All can be used at once• Buffer space size and organization fixed• No adaptivity involved in current results• Observe time-domain behavior
ConfigurationsConfigurations
• 32KB L1 data cache, 32B lines, direct-map• 0.5MB L2 cache, 64B line, direct-map• 8-line write buffer• Latencies:
– 1-cycle L1, 8-cycle L2, 60-cycle memory– 1-cycle prefetch, Write Buffer, Victim Cache
• All 3 mechanisms at once
Sparse-400
05
1015202530354045
1 38 75 112
149
186
223
260
297
334
371
408
445
482
519
556
Time(10M cycles/point)
Rat
io
SB Elimination VC Elimination MSB Elimination
Espresso
0
20
40
60
80
100
1 33 65 97 129
161
193
225
257
289
321
353
385
417
449
481
Time (1M cycles/point)
Rat
io
SB Elimination VC Eliminaion MSB Elimination
Water-n
0
20
40
60
80
100
Time (1M cycles/point)
Rat
io
SB Elimination VC Elimination MSB Elimination
FMM
01020304050607080
1 15 29 43 57 71 85 99 113
127
141
155
169
183
197
211
Time (10M cycles/point)
Rat
io
SB Elimination VC Elimination MSB Elimination
Observed Behavior Observed Behavior
• Programs exhibit different effect from each mechanism
– none is a consistent winner
• Within a program, the same holds in the time domain between mechanisms.
• Both of the above facts indicate a likely improvement from adaptivity
– Select a better one among mechanisms
• Even more can be expected from adaptively re-allocating from the combined buffer pool
– To reduce stall time– To reduce the number of misses
Possible Adaptive MechanismsPossible Adaptive Mechanisms
• Hardware:– a common pool of (small) n-word buffers– a set of possible policies, a subset of:
» Stride-directed prefetch» PC-based prefetch» History-based prefetch» Victim cache» Write buffer
Adaptive Hardware - Cont’dAdaptive Hardware - Cont’d
• Performance monitors for each type/buffer– misses, stall time on hit, thresholds
• Dynamic buffer allocator among mechanisms• Allocation and monitoring policy:
– Predict future behavior from observed past– Observe in time interval T, set for next T– Save perform. trends in next-level tags (<8bits)
Adaptive Hardware - Cont’dAdaptive Hardware - Cont’d
• Adapt the following– Number of buffers per mechanism
» May also include control, e.g. prediction tables– Prefetch lookahead (buffer depth)
» Increase when buffers fill up and are still stalling – Adaptivity interval
» Increase when every
Adaptivity via compilerAdaptivity via compiler
• Give software control over configuration setting• Provide feedback via same parameters as used by
hardware: stall time, miss rate, etc• Have the compiler
– select program points to change configuration– set parametrs based on hardware feedback – use compile-time knowledge as well
Further opportunities to adaptFurther opportunities to adapt
• L2 cache organization– variable-size line
• L2 non-sequential prefetch• L3 organization and use (for deep sub-)• In-memory adaptivity assist (DRAM tags)• Multiple processor scenarios
– Even longer latency– Coherence, hardware or software– Synchronization– Prefetch under and beyond the above
» Avoid coherence if possible» Prefetch past synchronization
– Assist Adaptive Scheduling
The AMRM ProjectThe AMRM Project= Compiler, Architecture and VLSI Research = Compiler, Architecture and VLSI Research
for AA Architecturesfor AA Architectures
Fault Detection and ContainmentInterface to mapping and synthesis hardware
Continuous Validation Strategies
Application AnalysisIdentification of AA MechanismsSemantic Retention Strategies
Compiler Instrumentation for Runtime
Partitioning, Synthesis, Mapping Algorithmsfor efficient runtime adaptation
Efficient reprogrammable circuit structures for rapid reconfiguration
Prototype hardware platform
Machine Definition
Compiler control
Memory hierarchy analysisref. structure identification
Protection tests
SummarySummary• Semiconductor advances are bringing powerful
changes to how systems are architected and built:– challenges underlying assumptions on synchronous digital
hardware designs» interconnect (local and global) dominates architectural
choices, local decision making is free;» in particular, it can be made adaptable using CAD tools.
• The AMRM Project:– achieve peak performance by adapting machine capabilities to
application and data characteristics.– Initial focus on memory hierarchy promises to yield high
performance gains due to worsening effects of memory (vs. cpu speeds) and increased data sets.
Appendix: Assists Being ExploredAppendix: Assists Being Explored
Victim CachingVictim Caching
• VC useful in case of conflict misses, long sequential reference streams. Prevent sharp fall off in performance when WSS is slightly larger than L1.
• Estimate WSS from the structure of the RM such as the size of the strongly connected components (SCCs)
• MORPH data-path structure supports addition of a parameterized victim/stream cache. The control logic is synthesized using CAD tools.
• Victim caches provide 50X the marginal improvement in hit rate over the primary cache.
mux
tag v one line
addr data
addrFAdata
MRU
Small (1-5 lines) fully-associative
cache configured as victim/stream cache
or stream buffer.
+1DirectmappedL1, L2
victim line
new line
tags
tags
to stream buffer
Victim CacheVictim Cache• Mainly used to eliminate conflict miss• Prediction: the memory address of a
cache line that is replaced is likely to be accessed again in near future
• Scenario for prediction to be effective: false sharing, ugly address mapping
• Architecture implementation: use a on-chip buffer to store the contents of recently replaced cache line
• Drawbacks– Ugly mapping can be rectified by cache
aware compiler– Small size of victim cache, probability of
memory address reuse within short period is very low.
– Experiment shows victim cache is not effective across the board for DI apps.
Tag L1
CPU
WBVC
Lower memory hierarchy
data
Stream BufferStream Buffer• Mainly used to eliminate compulsory/capacity misses• Prediction: if a memory address is missed, the consecutive address is likely
to be missed in near future• Scenario for prediction to be useful: stream access• Architecture implementation: when an address miss, prefetch consecutive
address into on-chip buffer. When there is a hit in stream buffer, prefetch the consecutive address of the hit address.
Lower memory hierarchy
Stream CacheStream Cache
• Modification of stream buffer• Use a separate cache to store stream data to prevent cache
pollution • When there is a hit in stream buffer, the hit address is sent to
stream cache instead of L1 cache
Stride PrefetchStride Prefetch
• Mainly used to eliminate compulsory/capacity miss• Prediction: if a memory address is missed, an
address that is offset by a distance from the missed address is likely to be missed in near future
• Scenario for prediction to be useful: stride access• Architecture implementation: when an address miss,
prefetch address that is offset by a distance from the missed address. When there is a hit in buffer, also prefetch the address that is offset by a distance from the hit address.
Miss Stride BufferMiss Stride Buffer
• Mainly used to eliminate conflict miss• Prediction: if a memory address miss again after N
other misses, the memory address is likely to miss again after N other misses
• Scenario for the prediction to be useful– multiple loop nests– some variables or array elements are reused across iterations
Advantage over Victim CacheAdvantage over Victim Cache
• Eliminate conflict miss that even cache aware compiler can not eliminate
– Ugly mappings are fewer and can be rectified– Much more conflicts are random. From probability perspective, a
certain memory address will conflict with other addresses after some time, but we can not know at compile time which address it will conflict.
• There can be a much longer period before the conflict address is reused
– Victim cache’s small size
Architecture ImplementationArchitecture Implementation
• Memory history buffer– FIFO buffer to record recently missed memory address– Predict only when there is a hit in the buffer– Miss stride can be calculated by the relative position of consecutive miss
for the same address– The size of the buffer determines the number of predictions
• Prefetch buffer (On-chip)– Store the contents of prefetched memory address– The size of the buffer determines how much we can tolerate the variation
of miss stride• Prefetch scheduler
– Select a right time to prefetch– Avoid collision
• Prefetcher– prefetch the contents of miss address into on-chip prefetch buffer
Prefetch SchedulerPrefetch Scheduler
Pointer Stream BufferPointer Stream Buffer
Appendix: Prefetching Adaptation Appendix: Prefetching Adaptation ResultsResults
Prefetching for Latency & BW Prefetching for Latency & BW ManagementManagement
• Combat latency deterioration– optimal prefetching:
» “memory side pointer chasing”– blocking mechanisms– fast barrier, broadcast support– synchronization support
• Bandwidth management– memory (re)organization to suit application
characteristics– translate and gather hardware
» “prefetching with compaction”
Adaptation for Latency ToleranceAdaptation for Latency Tolerance
• Operation1. Application sets prefetch parameters
(compiler controlled)2. Prefetching event generation (runtime controlled)
» when a new cache block is filledPrefetcher
virtual addr./data
physical addr.
additional addr.
CPU/L1
L2 Cacheda
ta
if(start<=vAddr<=end) { if(pAddr & 0x20) addr = pAddr - 0x20 else addr = pAddr + 0x20 <initiate fetch of cache line at addr to L1> }
22
Prefetching ExperimentsPrefetching Experiments
• Operation1. Application sets prefetch parameters (compiler controlled)
» set lower/uppoer bounds on memory regions (for memory protection etc.)
» download pointer extraction function» element size
2. Prefetching event generation (runtime controlled)» when a new cache block is filled
• Application viewgenerate_event (“pointer to pass matrix element structure”)...generate_event(“signal to enable prefetch”)< code on which prefetching is applied >generate_event(“signal to diable prefetch”)
Adaptation for Bandwidth ReductionAdaptation for Bandwidth Reduction• Prefetching Entire Row/Column• Pack Cache with Used Data Only
Processor
Addr. Translation
val1 val2 val3
cache
memory
Gather Logic
val1, RowPtr1, ColPtr1
val2, RowPtr2, ColPtr2
val3, RowPtr3, ColPtr3
valrowcol
rowPtrcolPtr
valrowcol
rowPtrcolPtr
valcolvalcol...
translateAccess Return
+ 64synthesize
Program View Physical Layout
L1 Cache
• No Change in Program Logical Data Structures
• Partition Cache • Translate Data• Synthesize Pointer
Simulation Results: LatencySimulation Results: Latency
• Sparse MM: blocking, prefetching, packing (all based on application data structures)
0
0.05
0.1
0.15
0.2
0.25
0.3
L1 Miss(R) L1 Miss (w )
Sparse-MM
SW-Blocked
HW-Pack/Unpack
HW-P/U-WB
Data Traffic
0
100
200
300
400
500
600
Sparse-MM
SW-Blocked
HW-P/U HW-P/U-WB
Data Traff ic
Cach
e M
iss
Rate
10X reduction in latency using application data structure optimization10X reduction in latency using application data structure optimization
Simulation Results: BandwidthSimulation Results: Bandwidth
• Optimization designed to significantly reduce the volume of data traffic– efficient fast storage management, packing, prefetch
0
0.05
0.1
0.15
0.2
0.25
0.3
L1 Miss(R) L1 Miss (w )
Sparse-MM
SW-Blocked
HW-Pack/Unpack
HW-P/U-WB
Data Traff ic
0
100
200
300
400
500
600
Sparse-MM
SW-Blocked
HW-P/U HW-P/U-WB
Data Traff ic
Dat
a Tr
affic
(MB)
100X reduction in BW using application-specific packing and fetching.100X reduction in BW using application-specific packing and fetching.