AMRM: Project Technical Approach Rajesh Gupta Project Kickoff Meeting November 5, 1998 Washington DC A Technology and Architectural View of Adaptation

AMRM:AMRM:Project Technical ApproachProject Technical Approach

Rajesh Gupta

Project Kickoff MeetingNovember 5, 1998Washington DC

A Technology and Architectural View of AdaptationA Technology and Architectural View of Adaptation

OutlineOutline

• Technology trends driving this project– Changing ground-rules in high-performance system design– Rethinking circuits and microelectronic system design

• Rethinking architectures– The opportunity of application-adaptive architectures– Adaptation Challenges

• Adaptation for memory hierarchy– Why memory hierarchy? – Adaptation space and possible gains

• Summary

Technology EvolutionTechnology Evolution

0102030405060708090

89 92 95 98 1 4 7

Wire delay, ns/cm.

Evolutionary growth but its effects are subtle and powerful!Evolutionary growth but its effects are subtle and powerful!

Industry continues to outpace NTRS projections on Industry continues to outpace NTRS projections on technology scaling and IC density.technology scaling and IC density.

Feature Size

250

180

130100

70

0

50

100

150

200

250

300

97 98 99 0 1 2 3 4 5 6 7 8 9 12Year of Shipment

NTRS-94NTRS-97

Average interconnect delay is greater than the gate delays!Average interconnect delay is greater than the gate delays!• Reduced marginal cost of logic and signal regeneration needs make it possible to include logic in inter-block interconnect.

Consider InterconnectConsider Interconnect

Dynamic interconnect

Static interconnect

I II III

Avg. Interconnect Length (scales with pitch)

Critical Length

CROSS-OVER REGION

Feature Size (nm)

Length (um)

1000

2000

3000

1000 100

3000 u

500nm

100 u

100nm

Rethinking Circuits Rethinking Circuits When Interconnect DominatesWhen Interconnect Dominates

• DEVICE:DEVICE: Choose better interconnect– Copper, low temperature interconnect

• CAD:CAD: Choose better interconnect topology, sizes– Minimize path from driver gate to each receiver gate

» e.g., A-tree algorithm yields about 12% reduction in delay– Select wire sizes to minimize net delays

» e.g., upto 35% reduction in delay by optimal sizing algorithms

• CKT:CKT: Use more signal repeaters in block-level designs– longest interconnect=2000 mu for 350nm process

• u-ARCH:u-ARCH: A storage element no longer defines a clock boundary

– Multiple storage elements in a single clock– Multiple state transitions in a clock period– Storage-controlled routing– Reduced marginal cost of logic

Implications: Circuit BlocksImplications: Circuit Blocks• Frequent use of signal repeaters in block-level

designs– longest interconnect=2000 u for 0.3 u process

• A storage element no longer (always) defines a clock boundary– storage delay (=1.5x switching delay)– multiple storage elements in a single clock– multiple state transitions in a clock period– storage-controlled routing

• Circuit block designs that work independently of data latencies– asynchronous blocks

• Heterogenous clocking interfaces– pausible clocking [Yun, ICCD96]– mixed synchronous, asynchronous circuit blocks.

Implications: ArchitecturesImplications: Architectures

• Architectures to exploit interconnect delays– pipeline interconnect delays [recall Cray-2]

» cycle time = max delay - min delay– use interconnect delay as the minimum delay– need P&R estimates early in the design

• Algorithms that use interconnect latencies– interconnect as functional units– functional unit schedules are based on a measure of spatial

distances

• Increase local decision making– multiple state transitions in a clock period– storage-controlled routing – re-programmable blocks in “custom layouts”

Opportunity: Application-Adaptive Opportunity: Application-Adaptive ArchitecturesArchitectures

• Exploit architectural “low-hanging” fruits– performance variation across applications (10-100X)– performance variation across data-sets (10X)

• Use interconnect and data-path reconfiguration to– increase performance– combat performance fragility and – improve fault tolerance

• Configurable hardware is used to improve utilization of performance critical resources

– instead of using configurable hardware to build additional resources

– design goal is to achieve peak performance across applications– configurable hardware leveraged in efficient utilization of

performance critical resources

Architectural AdaptationArchitectural Adaptation

• Each of the following elements can benefit from increased adaptability (above and beyond CPU programming)

– CPU– Memory hierarchy : eliminate false sharing– Memory system : virtual memory layout based on cache miss data– IO : disk layout based on access pattern– Network interface : scheduling to reduced end-to-end latency

• Adaptability used to build– programmable engines in IO, memory controllers, cache controllers,

network devices– configurable data-paths and logic in any part of the system– configurable queueing in scheduling for interconnect, devices, memory– smart interfaces for information flow from applications to hardware– performance monitoring and coordinated resource management...

Intelligent interfaces, information formats, mechanisms and policies.

Adaptation ChallengesAdaptation Challenges

• Is application-driven adaptation viable from technology and cost point of view?

• How to structure adaptability– to maximize the performance benefits– provide protection, multitasking and a reasonable programming

environment– enable easy exploitation of adaptability through automatic or

semi-automatic means.

• We focus on memory hierarchy as the first candidate to explore the extent and utility of adaptation.

Alpha 21164 Execution Time Components

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

TPC-C SPECint95 SPECfp95

Benchmark

Other stalls

L3 Cache stalls

L2 cache stallsL1 cache stalls

Computation

Why Cache Memory?Why Cache Memory?

4-year technological scaling4-year technological scaling

• CPU performance increases by 47% per year• DRAM performance increases by 7%per year• Assume the Alpha is scaled using this scaling and

– Organization remains 8KB/96KB/4MB/mem– Benchmarks requirements are same

• Expect something similar if both L2/L3 cache size and benchmark size will increase

21164 scaled to .15 micron

0

0.1

0.2

0.3

0.4

0.5

0.6

TPC-C SPECint95 SPECfp95

Benchmark

% o

rigin

al e

xecu

tion

time

Other stallsL3 Cache stallsL2 cache stallsL1 cache stallsComputation

Impact of Memory StallsImpact of Memory Stalls

• A statically scheduled processor with a blocking cache stalls, on average, for

– 15% of the time in integer benchmarks– 43% of the time in f.p. benchmarks– 70% of the time in transaction benchmark

• Possible performance improvements due to improved memory hierarchy without technology scaling:

– 1.17x, – 1.89x, and – 3.33x

• Possible improvements with technology scaling – 2.4x, 7.5x, and 20x

0

50

100

150

200

250

cg.S fft lu spark sparse-400 ocean-con ocean-nonoptimized unlimited perfect unlimited-64byte

Opportunities for Adaptivity in CachesOpportunities for Adaptivity in Caches

• Cache organization• Cache performance “assist” mechanisms• Hierarchy organization• Memory organization (DRAM, etc)• Data layout and address mapping• Virtual Memory• Compiler assist

Opportunities - Cont’dOpportunities - Cont’d

• Cache organization: adapt what?– Size: NO– Associativity: NO– Line size: MAYBE, – Write policy: YES (fetch,allocate,w-back/thru)– Mapping function: MAYBE

• Organization, clock rate optimized together


• Cache “Assist”: prefetch, write buffer, victim cache, etc. between different levels

– due to delay/size constraint, all of the above cannot be implemented together

– improvement as f(size) may not be at max_size

• Adapt what?– which mechanism(s) to use, algorithms– mechanism “parameters”: size, lookahead, etc


• Hierarchy Organization:– Where are cache assist mechanisms applied?

» Between L1 and L2» Between L1 and Memory» Between L2 and Memory...

– What are the datapaths like?» Is prefetch, victim cache, write buffer data written into a

next-level cache? » How much parallelism is possible in the hiearchy?


• Memory Organization– Cached DRAM?

» yes, but very limited configurations– Interleave change?

» Hard to accomplish dynamically– Tagged memory

» Keep state for adaptivity


• Data layout and address mapping– In theory, something can be done but

» would require time-consuming data re-arrangement– MP case is even worse– Adaptive address mapping or hashing

» based on what?


• Compiler assist can– Select initial hardware configuration– Pass hints on to hardware– Generate code to collect run-time info and adapt during execution– Adapt configuration after being “called” at certain intervals

during execution– Re-optimize code at run-time


• Virtual Memory can adapt– Page size?– Mapping?– Page prefetching/read ahead– Write buffer (file cache)– The above under multiprogramming?

Applying AdaptivityApplying Adaptivity

• What Drives Adaptivity? • Performance impact, overall and/or relative

» “Effectiveness”, e.g. miss rate» Processor stall introduced» Program characteristics

• When to perform adaptive action?– Run time: use feedback from hardware– Compile time: insert code, set up hardware

Where to Implement Adaptivity?Where to Implement Adaptivity?

• In Software: compiler and/or OS– (Static) Knowledge of program behavior– Factored into optimization and scheduling– Extra code, overhead– Lack of dynamic run-time information– Rate of adaptivity– Requires recompilation, OS changes

Where to Implement?- Cont’dWhere to Implement?- Cont’d

• Hardware– dynamic information available– fast decision mechanism possible– transparent to software (thus safe)– delay, clock rate limit algorithm complexity– difficult to maintain long-term trends– little knowledge of program behavior

Where to Implement - Cont’dWhere to Implement - Cont’d

• Hardware/software– Software can set coarse hardware parameters– Hardware can supply software dynamic info– Perhaps more complex algorithms can be used– Software modification required– Communication mechanism required

Current InvestigationCurrent Investigation

• L1 cache assist– See wide variability in assist mechanisms’ effectiveness between

» Individual Programs» Within a program as a function of time

– Propose a hardware mechanism to select between assist types and allocate buffer space

– Give compiler an opportunity to set parameters

Mechanisms Used (L1 to L2)Mechanisms Used (L1 to L2)

• Prefetching– Stream Buffers– Stride-directed, based on address alone– Miss Stride: prefetch the same addr using the number of

intervening misses as lookahead– Pointer Stride

• Victim Cache• Write Buffer

Mechanisms Used - Cont’dMechanisms Used - Cont’d

• A mechanism can be used by itself– Which is most effective?

• All can be used at once• Buffer space size and organization fixed• No adaptivity involved in current results• Observe time-domain behavior

ConfigurationsConfigurations

• 32KB L1 data cache, 32B lines, direct-map• 0.5MB L2 cache, 64B line, direct-map• 8-line write buffer• Latencies:

– 1-cycle L1, 8-cycle L2, 60-cycle memory– 1-cycle prefetch, Write Buffer, Victim Cache

• All 3 mechanisms at once

Sparse-400

05

1015202530354045

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

Time(10M cycles/point)

Rat

io

SB Elimination VC Elimination MSB Elimination

Espresso

0

20

40

60

80

100

1 33 65 97 129

161

193

225

257

289

321

353

385

417

449

481

Time (1M cycles/point)

Rat

io

SB Elimination VC Eliminaion MSB Elimination

Water-n

0

20

40

60

80

100


Rat

io


FMM

01020304050607080

1 15 29 43 57 71 85 99 113

127

141

155

169

183

197

211


Rat

io


Observed Behavior Observed Behavior

• Programs exhibit different effect from each mechanism

– none is a consistent winner

• Within a program, the same holds in the time domain between mechanisms.

• Both of the above facts indicate a likely improvement from adaptivity

– Select a better one among mechanisms

• Even more can be expected from adaptively re-allocating from the combined buffer pool

– To reduce stall time– To reduce the number of misses

Possible Adaptive MechanismsPossible Adaptive Mechanisms

• Hardware:– a common pool of (small) n-word buffers– a set of possible policies, a subset of:

» Stride-directed prefetch» PC-based prefetch» History-based prefetch» Victim cache» Write buffer

Adaptive Hardware - Cont’dAdaptive Hardware - Cont’d

• Performance monitors for each type/buffer– misses, stall time on hit, thresholds

• Dynamic buffer allocator among mechanisms• Allocation and monitoring policy:

– Predict future behavior from observed past– Observe in time interval T, set for next T– Save perform. trends in next-level tags (<8bits)

Adaptive Hardware - Cont’dAdaptive Hardware - Cont’d

• Adapt the following– Number of buffers per mechanism

» May also include control, e.g. prediction tables– Prefetch lookahead (buffer depth)

» Increase when buffers fill up and are still stalling – Adaptivity interval

» Increase when every

Adaptivity via compilerAdaptivity via compiler

• Give software control over configuration setting• Provide feedback via same parameters as used by

hardware: stall time, miss rate, etc• Have the compiler

– select program points to change configuration– set parametrs based on hardware feedback – use compile-time knowledge as well

Further opportunities to adaptFurther opportunities to adapt

• L2 cache organization– variable-size line

• L2 non-sequential prefetch• L3 organization and use (for deep sub-)• In-memory adaptivity assist (DRAM tags)• Multiple processor scenarios

– Even longer latency– Coherence, hardware or software– Synchronization– Prefetch under and beyond the above

» Avoid coherence if possible» Prefetch past synchronization

– Assist Adaptive Scheduling

The AMRM ProjectThe AMRM Project= Compiler, Architecture and VLSI Research = Compiler, Architecture and VLSI Research

for AA Architecturesfor AA Architectures

Fault Detection and ContainmentInterface to mapping and synthesis hardware

Continuous Validation Strategies

Application AnalysisIdentification of AA MechanismsSemantic Retention Strategies

Compiler Instrumentation for Runtime

Partitioning, Synthesis, Mapping Algorithmsfor efficient runtime adaptation

Efficient reprogrammable circuit structures for rapid reconfiguration

Prototype hardware platform

Machine Definition

Compiler control

Memory hierarchy analysisref. structure identification

Protection tests

SummarySummary• Semiconductor advances are bringing powerful

changes to how systems are architected and built:– challenges underlying assumptions on synchronous digital

hardware designs» interconnect (local and global) dominates architectural

choices, local decision making is free;» in particular, it can be made adaptable using CAD tools.

• The AMRM Project:– achieve peak performance by adapting machine capabilities to

application and data characteristics.– Initial focus on memory hierarchy promises to yield high

performance gains due to worsening effects of memory (vs. cpu speeds) and increased data sets.

Appendix: Assists Being ExploredAppendix: Assists Being Explored

Victim CachingVictim Caching

• VC useful in case of conflict misses, long sequential reference streams. Prevent sharp fall off in performance when WSS is slightly larger than L1.

• Estimate WSS from the structure of the RM such as the size of the strongly connected components (SCCs)

• MORPH data-path structure supports addition of a parameterized victim/stream cache. The control logic is synthesized using CAD tools.

• Victim caches provide 50X the marginal improvement in hit rate over the primary cache.

mux

tag v one line

addr data

addrFAdata

MRU

Small (1-5 lines) fully-associative

cache configured as victim/stream cache

or stream buffer.

+1DirectmappedL1, L2

victim line

new line

tags

tags

to stream buffer

Victim CacheVictim Cache• Mainly used to eliminate conflict miss• Prediction: the memory address of a

cache line that is replaced is likely to be accessed again in near future

• Scenario for prediction to be effective: false sharing, ugly address mapping

• Architecture implementation: use a on-chip buffer to store the contents of recently replaced cache line

• Drawbacks– Ugly mapping can be rectified by cache

aware compiler– Small size of victim cache, probability of

memory address reuse within short period is very low.

– Experiment shows victim cache is not effective across the board for DI apps.

Tag L1

CPU

WBVC

Lower memory hierarchy

data

Stream BufferStream Buffer• Mainly used to eliminate compulsory/capacity misses• Prediction: if a memory address is missed, the consecutive address is likely

to be missed in near future• Scenario for prediction to be useful: stream access• Architecture implementation: when an address miss, prefetch consecutive

address into on-chip buffer. When there is a hit in stream buffer, prefetch the consecutive address of the hit address.

Lower memory hierarchy

Stream CacheStream Cache

• Modification of stream buffer• Use a separate cache to store stream data to prevent cache

pollution • When there is a hit in stream buffer, the hit address is sent to

stream cache instead of L1 cache

Stride PrefetchStride Prefetch

• Mainly used to eliminate compulsory/capacity miss• Prediction: if a memory address is missed, an

address that is offset by a distance from the missed address is likely to be missed in near future

• Scenario for prediction to be useful: stride access• Architecture implementation: when an address miss,

prefetch address that is offset by a distance from the missed address. When there is a hit in buffer, also prefetch the address that is offset by a distance from the hit address.

Miss Stride BufferMiss Stride Buffer

• Mainly used to eliminate conflict miss• Prediction: if a memory address miss again after N

other misses, the memory address is likely to miss again after N other misses

• Scenario for the prediction to be useful– multiple loop nests– some variables or array elements are reused across iterations

Advantage over Victim CacheAdvantage over Victim Cache

• Eliminate conflict miss that even cache aware compiler can not eliminate

– Ugly mappings are fewer and can be rectified– Much more conflicts are random. From probability perspective, a

certain memory address will conflict with other addresses after some time, but we can not know at compile time which address it will conflict.

• There can be a much longer period before the conflict address is reused

– Victim cache’s small size

Architecture ImplementationArchitecture Implementation

• Memory history buffer– FIFO buffer to record recently missed memory address– Predict only when there is a hit in the buffer– Miss stride can be calculated by the relative position of consecutive miss

for the same address– The size of the buffer determines the number of predictions

• Prefetch buffer (On-chip)– Store the contents of prefetched memory address– The size of the buffer determines how much we can tolerate the variation

of miss stride• Prefetch scheduler

– Select a right time to prefetch– Avoid collision

• Prefetcher– prefetch the contents of miss address into on-chip prefetch buffer

Prefetch SchedulerPrefetch Scheduler

Pointer Stream BufferPointer Stream Buffer

Appendix: Prefetching Adaptation Appendix: Prefetching Adaptation ResultsResults

Prefetching for Latency & BW Prefetching for Latency & BW ManagementManagement

• Combat latency deterioration– optimal prefetching:

» “memory side pointer chasing”– blocking mechanisms– fast barrier, broadcast support– synchronization support

• Bandwidth management– memory (re)organization to suit application

characteristics– translate and gather hardware

» “prefetching with compaction”

Adaptation for Latency ToleranceAdaptation for Latency Tolerance

• Operation1. Application sets prefetch parameters

(compiler controlled)2. Prefetching event generation (runtime controlled)

» when a new cache block is filledPrefetcher

virtual addr./data

physical addr.

additional addr.

CPU/L1

L2 Cacheda

ta

if(start<=vAddr<=end) { if(pAddr & 0x20) addr = pAddr - 0x20 else addr = pAddr + 0x20 <initiate fetch of cache line at addr to L1> }

22

Prefetching ExperimentsPrefetching Experiments

• Operation1. Application sets prefetch parameters (compiler controlled)

» set lower/uppoer bounds on memory regions (for memory protection etc.)

» download pointer extraction function» element size

2. Prefetching event generation (runtime controlled)» when a new cache block is filled

• Application viewgenerate_event (“pointer to pass matrix element structure”)...generate_event(“signal to enable prefetch”)< code on which prefetching is applied >generate_event(“signal to diable prefetch”)

Adaptation for Bandwidth ReductionAdaptation for Bandwidth Reduction• Prefetching Entire Row/Column• Pack Cache with Used Data Only

Processor

Addr. Translation

val1 val2 val3

cache

memory

Gather Logic

val1, RowPtr1, ColPtr1



valrowcol

rowPtrcolPtr

valrowcol

rowPtrcolPtr

valcolvalcol...

translateAccess Return

+ 64synthesize

Program View Physical Layout

L1 Cache

• No Change in Program Logical Data Structures

• Partition Cache • Translate Data• Synthesize Pointer

Simulation Results: LatencySimulation Results: Latency

• Sparse MM: blocking, prefetching, packing (all based on application data structures)

0

0.05

0.1

0.15

0.2

0.25

0.3

L1 Miss(R) L1 Miss (w )

Sparse-MM

SW-Blocked

HW-Pack/Unpack

HW-P/U-WB

Data Traffic

0

100

200

300

400

500

600

Sparse-MM

SW-Blocked

HW-P/U HW-P/U-WB

Data Traff ic

Cach

e M

iss

Rate

10X reduction in latency using application data structure optimization10X reduction in latency using application data structure optimization

Simulation Results: BandwidthSimulation Results: Bandwidth

• Optimization designed to significantly reduce the volume of data traffic– efficient fast storage management, packing, prefetch

0

0.05

0.1

0.15

0.2

0.25

0.3

L1 Miss(R) L1 Miss (w )

Sparse-MM

SW-Blocked

HW-Pack/Unpack

HW-P/U-WB

Data Traff ic

0

100

200

300

400

500

600

Sparse-MM

SW-Blocked

HW-P/U HW-P/U-WB

Data Traff ic

Dat

a Tr

affic

(MB)

100X reduction in BW using application-specific packing and fetching.100X reduction in BW using application-specific packing and fetching.

Documents

AMRM: Project Technical Approach Rajesh Gupta Project Kickoff Meeting November 5, 1998 Washington DC A Technology and Architectural View of Adaptation