Upload
yahto
View
31
Download
0
Tags:
Embed Size (px)
DESCRIPTION
HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing. PIs: Alvin M. Despain and Jean-Luc Gaudiot DARPA DIS PI Meeting Santa Fe, NM. March 26-27, 2002. Sensor Inputs. Application (FLIR SAR VIDEO ATR /SLD Scientific ). Decoupling Compiler. Processor. Processor. - PowerPoint PPT Presentation
Citation preview
HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing
PIs: Alvin M. Despain and Jean-Luc GaudiotDARPA DIS PI Meeting
Santa Fe, NM
March 26-27, 2002
HiDISC: Hierarchical Decoupled Instruction Set Computer
New Ideas• A dedicated processor for each level of the memory hierarchy
• Explicit management of each level of the memory hierarchy using instructions generated by the compiler
• Hide memory latency by converting data access predictability to data access locality
• Exploit instruction-level parallelism without extensive scheduling hardware
• Zero overhead prefetches for maximal computation throughput
Impact• 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor
• 7.4x speedup for matrix multiply over an in-order issue superscalar processor
• 2.6x speedup for matrix decomposition/substitution over an in-order issue superscalar processor
• Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS)
• Allows the compiler to solve indexing functions for irregular applications
• Reduced system cost for high-throughput scientific codes
Dynamic Database
Memory
Cache
Registers
Application(FLIR SAR VIDEO ATR /SLD Scientific )Application
(FLIR SAR VIDEO ATR /SLD Scientific )
Processor
Decoupling CompilerDecoupling Compiler
HiDISC Processor
SensorInputs
Processor Processor
Situational Awareness
Schedule
Start April 2001
• Defined benchmarks• Completed simulator• Performed instruction-level simulations on hand-compiled benchmarks
• Continue simulations of more benchmarks (SAR)• Define HiDISC architecture• Benchmark result• Update Simulator
• Developed and tested a complete decoupling compiler • Generated performance statistics and evaluated design
March 2002
Accomplishments Design of the HiDISC model Compiler development (operational) Simulator design (operational) Performance Evaluation
Three DIS benchmarks (Multidimensional Fourier Transform, Method of Moments, Data Management)
Five stressmarks (Pointer, Transitive Closure, Neighborhood, Matrix, Field)
Results Mostly higher performance (some lower)
Range of applications HiDISC of the future
Outline Original Technical Objective HiDISC Architecture Review Benchmark Results Conclusion and Future Works
HiDISC: Hierarchical Decoupled Instruction Set Computer
Dynamic Database
Memory
Cache
Registers
Application(FLIR SAR VIDEO ATR /SLD Scientific )Application
(FLIR SAR VIDEO ATR /SLD Scientific )
Processor
Decoupling CompilerDecoupling Compiler
HiDISC Processor
SensorInputs
Processor Processor
Situational Awareness
Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year)
Problem: Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994]
Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs
Present Solutions: Larger Caches, Prefetching (software & hardware), Simultaneous Multithreading
Present SolutionsSolutionLarger Caches
Hardware Prefetching
Software Prefetching
Multithreading
Limitations— Slow
— Works well only if working set fits cache and there is temporal locality.
— Cannot be tailored for each application
— Behavior based on past and present execution-time behavior
— Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching
— Adaptive software prefetching is required to change prefetch distance during run-time
— Hard to insert prefetches for irregular access patterns
— Solves the throughput problem, not the memory latency problem
Observation:
• Software prefetching impacts compute performance
• PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching
The HiDISC Approach
Approach:
• Add a processor to manage prefetching hide overhead
• Compiler explicitly manages the memory hierarchy
• Prefetch distance adapts to the program runtime behavior
What is HiDISC? A dedicated processor for each
level of the memory hierarchy
Explicitly manage each level of the memory hierarchy using instructions generated by the compiler
Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch)
Exploit instruction-level parallelism without extensive scheduling hardware
Zero overhead prefetches for maximal computation throughputHiDISC
Access Processor(AP)
Access Processor(AP)
2-issue
Cache Mgmt. Processor (CMP)
Registers
3-issue
L2 Cache and Higher Level
ComputationProcessor (CP)Computation
Processor (CP)
L1 Cache
3-issueSlip
Control Queue
Store Data Queue
Store Address Queue
Load Data Queue
Slip Control Queue The Slip Control Queue (SCQ) adapts dynamically
Late prefetches = prefetched data arrived after load had been issued
Useful prefetches = prefetched data arrived before load had been issued
if (prefetch_buffer_full ())Don’t change size of SCQ;
else if ((2*late_prefetches) > useful_prefetches)
Increase size of SCQ;else
Decrease size of SCQ;
Decoupling Programs for HiDISC(Discrete Convolution - Inner Loop)
for (j = 0; j < i; ++j)y[i]=y[i]+(x[j]*h[i-j-1]);
while (not EOD)y = y + (x * h);
send y to SDQ
for (j = 0; j < i; ++j) {load (x[j]);load (h[i-j-1]);GET_SCQ;
}send (EOD token)send address of y[i] to SAQ
for (j = 0; j < i; ++j) {prefetch (x[j]);prefetch (h[i-j-1];PUT_SCQ;
}
Inner Loop Convolution
Computation Processor Code
Access Processor Code
Cache Management Code
SAQ: Store Address QueueSDQ: Store Data QueueSCQ: Slip Control QueueEOD: End of Data
General View of the Compiler
HiDISC Compilation Overview
Gcc
Source Program
Binary Code
Disassembler
Stream Separator
Access CodeComputation Code Cache Mgmt. Code
Sequential Source
Data Dependency Graph
Instruction Chasing for Backward Slice
Defining Load/Store Instructions
Computation Stream
Access Stream
Insert Communication Instructions
Computation Code
Access Code
Cache Management Code
Insert prefetch Instructions
HiDISC Stream Separator
Stream Separation: Backward Load/Store Chasing
lw $24, 24($sp) mul $25, $24, 8 la $8, z addu $9, $25, $8
l.d $f16, 88($9) l.d $f18, 0($sp) mul.d $f4, $f16, $f18 l.d $f6, 8($sp) l.d $f8, 80($9) mul.d $f10, $f6, $f8 add.d $f16, $f4, $f10 la $10, y addu $11, $25, $10 l.d $f18, 0($11) mul.d $f6, $f16, $f18 l.d $f8, 16($sp) add.d $f4, $f6, $f8 la $12, x addu $13, $25, $12 s.d $f4, 0($13)
x[k] = q + y[k]*( r*z[k+10] + t*z[k+11] );
Computation Stream
Access Stream
Backward chasing Communicate via LDQ
Communicate via SDQ
Other remaining instructions are sent to the Computation stream
LLL1 Load/store instructions and “backward slice” are included as Access stream
Communications between the two streams are also defined at this point
Stream Separation: Creating an Access Stream
Communication needs to take place via the various hardware queues.
Insert Communication instructions
CMP instructions are copied from access stream instructions, except that load instructions are replaced by prefetch instructions
LDQ
SDQ
lw $24, 24($sp) mul $25, $24, 8 la $8, z addu $9, $25, $8
l.d $LDQ, 88($9) l.d $LDQ, 0($sp) l.d $LDQ, 8($sp) l.d $LDQ, 80($9) la $10, y addu $11, $25, $10 l.d $LDQ, 0($11) l.d $LDQ, 16($sp) la $12, x addu $13, $25, $12 s.d $SDQ, 0($13)
x[k] = q + y[k]*( r*z[k+10] + t*z[k+11] );
Computation Stream
Access Stream
mul.d $f4, $LDQ, $LDQ mul.d $f10, $LDQ, $LDQ add.d $f16, $f4, $f10 mul.d $f6, $f16, $LDQ add.d $SDQ, $f6, $LDQ
DIS Benchmarks
Application oriented benchmarks Main characteristic:
Large data sets – non contiguous memory access and no temporal locality
DIS Benchmarks DescriptionMethod of Moments (MoM)
Computing the electromagnetic scattering from complex objects
Contains computational complexity and requests high memory speed
Done
Multidimensional Fourier Transform
(FFT)
Wide range of application usage
Image processing, synthesis, convolution deconvolution and digital signal filtering
Done
Data Management
(DM)
Traditional DBMS processing.
Focusing on index algorithms and ad hoc query processing. Done
Image Understanding
Sampling contains algorithms that perform spatial filtering and data reduction.
Morphological filter component, a region of interest (ROI) selection component, and a feature extraction component
Compiled
(Input file problem)
SAR Ray Tracing
Cost-effective alternative to real data collectionsSAR Ray-Tracing Benchmark utilizes an image-domain approach to SAR image simulation
Done
Atlantic Aerospace Stressmark Suite Smaller and more specific procedures
Seven individual data intensive benchmarks Directly illustrate particular elements of the DIS
problem Fewer computation operations than DIS
benchmarks Focusing on irregular memory accesses
Stressmark Suite
Name of Stressmark
Problem Memory Access
Pointer Pointer following Small blocks at unpredictable locations. Can be parallelized
Done
Update Pointer following with memory update
Small blocks at unpredictable location Done
Matrix Conjugate gradient simultaneous equation solver
Dependent on matrix representation Likely to be irregular or mixed, with mixed levels of reuse
Done
Neighborhood
Calculate image texture measures by finding sum and difference histograms
Regular access to pairs of words at arbitrary distances
Done
Field Collect statistics on large field of words
Regular, with little re-use Done
Corner-Turn Matrix transposition Block movement between processing nodes with practically nil computation
N/A
Transitive Closure
Find all-pairs-shortest-path solution for a directed graph
Dependent on matrix representation, but requires reads and writes to different matrices concurrently
Done
* DIS Stressmark Suite Version 1.0, Atlantic Aerospace Division
Pointer Stressmarks Basic idea: repeatedly follow pointers to randomized
locations in memory Memory access pattern is unpredictable (input data
dependent) Randomized memory access pattern:
Not sufficient temporal and spatial locality for conventional cache architectures
HiDISC architecture should provide lower average memory access latency The pointer chasing in AP can run ahead without
blocking by the CP
Simulation Parameters (simplescalar 3.0, 2000)
Branch predict mode Bimodal
Branch table size 2048
Issue width 4
Window size for superscalar RUU =16 LSQ = 8
Slip distance for AP/EP 50
Data L1 cache configuration 128 sets, 32 block, 4 -way set associative , LRU
Data L1 cache latency 1
Unified L2 cache configuration 1024 sets, 64 block, 4 - way set associative, LRU
Unified L2 cache latency 6
Integer functional units ALU( x 4), MUL/DIV
Floating point functional units ALU( x 4), MUL/DIV
Number of memory ports 2
Superscalar and HiDISC
Ability to vary the memory latency Superscalar supports OOO issue with 16 RUU
(Register Update Unit) and 8 LSQ (Load Store Queue)
Each of AP and CP issues instructions in order
DIS Benchmark/Stressmark Results
Stressmarks
Pointer Tran. Cl. Field Matrix Neighbor Update Corner
↑ ↑ ↓ ↓ ↓ ↓ N/A
DIS Benchmarks
FFT MoM DM IU RAY↑ ↑ ↑ N/A ↑
↑: better with HiDISC↓: better with Superscalar
DIS Benchmark Results
RayTracing/DIS
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
20 40 60 80 100 120
L2 Cache Miss Latency
IPC
Superscalar HiDISC (without CMP) HiDISC (25%) HiDISC (37.50%)
DM/DIS
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
20 40 60 80 100 120
L2 Cache Miss Latency
IPC
Superscalar HiDISC (without CMP ) HiDISC (55.29%) HiDISC (84.71%)
FFT/DIS
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
20 40 60 80 100 120
L2 Cache Miss Latency
IPC
Superscalar HiDISC (without CMP ) HiDISC (9.23%) HiDISC (16.42%)
MoM/DIS
1.85
1.9
1.95
2
2.05
2.1
2.15
1 2 3 4 5 6
L2 Cache Miss Latency
IPC
Superscalar HiDISC (without CMP ) HiDISC (5.8%) HiDISC (50.7%)
DIS benchmarks perform extremely well in general with our decoupled processing for the following reasons: Many long latency floating-point operations Robust with longer memory latency (eg, Method of
Moment is a more stream-like process) In FFT, the HiDISC architecture also suffers from longer
memory latencies (Due to the data dependencies between two streams)
DM is not affected by the longer memory latency in either case.
DIS Benchmark Performance
Stressmark Results
Pointer/Stress
1.25
1.3
1.35
1.4
1.45
1.5
20 40 60 80 100 120
L2 Cache Miss Latency
IPC
Superscalar HiDISC (without CMP) HiDISC (1.75%) HiDISC (74.08%)
Transitive Closure/Stress
0
0.2
0.4
0.6
0.8
1
1.2
20 40 60 80 100 120
L2 Cache Miss Latency
IPC
Superscalar HiDISC (without CMP) HiDISC (17.50%) HiDISC (99.97%)
Stressmark Results – Good Performance Pointer chasing can be executed far ahead by
using decoupled access stream It does not requires the computation results from
the CP Transitive closure also produces good results
Not much in the AP depend on the results of CP AP can run ahead
Some Weak Performance
Matrix/Stress
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
20 40 60 80 100 120
L2 Cache Miss Latency
IPC
Superscalar HiDISC (without CMP) HiDISC (37.7%) HiDISC (57.9%)
Update/Stress
1.4
1.45
1.5
1.55
1.6
1.65
1.7
20 40 60 80 100 120
L2 Cache Miss Latency
IPC
Superscalar HiDISC (without CMP) HiDISC (49.8%) HiDISC (74.8%)
Neighborhood/Stress
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
20 40 60 80 100 120
L2 Cache Miss Latency
IPC
Superscalar HiDISC (without CMP ) HiDISC (50%) HiDISC (75%)
Field/Stress
1.5
1.55
1.6
1.65
1.7
1.75
1.8
20 40 60 80 100 120
L2 Cache Miss Latency
IPC
Superscalar HiDISC (without CMP) HiDISC (3.3%) HiDISC (99.6%)
Performance Bottleneck Too much synchronization causes loss of
decoupling Unbalanced code size between two streams
Stressmark suite is highly “access-heavy” Application domain for decoupled architectures:
Balanced ratio between computation and memory access
Synthesis Synchronization degrades performance
When the dependency of AP on CP increases, the slip distance of AP cannot be increased
More stream like applications will benefit from using HiDISC Multithreading support is needed
Applications should contain enough computation (1 to 1 ratio) to hide the memory access latency
CMP should be simple It executes redundant operations if the data is already in cache Cache pollution can occur
Flexi-DISC
Fundamental characteristics: inherently highly dynamic at
execution time. Dynamic reconfigurable central
computational kernel (CK) Multiple levels of caching and
processing around CK adjustable prefetching
Multiple processors on a chip which will provide for a flexible adaptation from multiple to single processors and horizontal sharing of the existing resources.
Memory Interface Ring
Low Level Cache Access Ring
Computation Kernel
Flexi-DISC Partitioning of Computation Kernel
It can be allocated to the different portions of the application or different applications
CK requires separation of the next ring to feed it with data
The variety of target applications makes the memory accesses unpredictable
Identical processing units for outer rings
Highly efficient dynamic partitioning of the resources and their run-time allocation can be achieved
Memory Interface Ring
Low Level Cache Access Ring
Computation Kernel
Application 1
Application 2
Application 3