HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

PIs: Alvin M. Despain and Jean-Luc GaudiotDARPA DIS PI Meeting

Santa Fe, NM

March 26-27, 2002

HiDISC: Hierarchical Decoupled Instruction Set Computer

New Ideas• A dedicated processor for each level of the memory hierarchy

• Explicit management of each level of the memory hierarchy using instructions generated by the compiler

• Hide memory latency by converting data access predictability to data access locality

• Exploit instruction-level parallelism without extensive scheduling hardware

• Zero overhead prefetches for maximal computation throughput

Impact• 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor

• 7.4x speedup for matrix multiply over an in-order issue superscalar processor

• 2.6x speedup for matrix decomposition/substitution over an in-order issue superscalar processor

• Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS)

• Allows the compiler to solve indexing functions for irregular applications

• Reduced system cost for high-throughput scientific codes

Dynamic Database

Memory

Cache

Registers

Application(FLIR SAR VIDEO ATR /SLD Scientific )Application

(FLIR SAR VIDEO ATR /SLD Scientific )

Processor

Decoupling CompilerDecoupling Compiler

HiDISC Processor

SensorInputs

Processor Processor

Situational Awareness

Schedule

Start April 2001

• Defined benchmarks• Completed simulator• Performed instruction-level simulations on hand-compiled benchmarks

• Continue simulations of more benchmarks (SAR)• Define HiDISC architecture• Benchmark result• Update Simulator

• Developed and tested a complete decoupling compiler • Generated performance statistics and evaluated design

March 2002

Accomplishments Design of the HiDISC model Compiler development (operational) Simulator design (operational) Performance Evaluation

Three DIS benchmarks (Multidimensional Fourier Transform, Method of Moments, Data Management)

Five stressmarks (Pointer, Transitive Closure, Neighborhood, Matrix, Field)

Results Mostly higher performance (some lower)

Range of applications HiDISC of the future

Outline Original Technical Objective HiDISC Architecture Review Benchmark Results Conclusion and Future Works

HiDISC: Hierarchical Decoupled Instruction Set Computer

Dynamic Database

Memory

Cache

Registers

Application(FLIR SAR VIDEO ATR /SLD Scientific )Application

(FLIR SAR VIDEO ATR /SLD Scientific )

Processor

Decoupling CompilerDecoupling Compiler

HiDISC Processor

SensorInputs

Processor Processor

Situational Awareness

Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year)

Problem: Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994]

Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs

Present Solutions: Larger Caches, Prefetching (software & hardware), Simultaneous Multithreading

Present SolutionsSolutionLarger Caches

Hardware Prefetching

Software Prefetching

Multithreading

Limitations— Slow

— Works well only if working set fits cache and there is temporal locality.

— Cannot be tailored for each application

— Behavior based on past and present execution-time behavior

— Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching

— Adaptive software prefetching is required to change prefetch distance during run-time

— Hard to insert prefetches for irregular access patterns

— Solves the throughput problem, not the memory latency problem

Observation:

• Software prefetching impacts compute performance

• PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching

The HiDISC Approach

Approach:

• Add a processor to manage prefetching hide overhead

• Compiler explicitly manages the memory hierarchy

• Prefetch distance adapts to the program runtime behavior

What is HiDISC? A dedicated processor for each

level of the memory hierarchy

Explicitly manage each level of the memory hierarchy using instructions generated by the compiler

Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch)

Exploit instruction-level parallelism without extensive scheduling hardware

Zero overhead prefetches for maximal computation throughputHiDISC

Access Processor(AP)

Access Processor(AP)

2-issue

Cache Mgmt. Processor (CMP)

Registers

3-issue

L2 Cache and Higher Level

ComputationProcessor (CP)Computation

Processor (CP)

L1 Cache

3-issueSlip

Control Queue

Store Data Queue

Store Address Queue

Load Data Queue

Slip Control Queue The Slip Control Queue (SCQ) adapts dynamically

Late prefetches = prefetched data arrived after load had been issued

Useful prefetches = prefetched data arrived before load had been issued

if (prefetch_buffer_full ())Don’t change size of SCQ;

else if ((2*late_prefetches) > useful_prefetches)

Increase size of SCQ;else

Decrease size of SCQ;

Decoupling Programs for HiDISC(Discrete Convolution - Inner Loop)

for (j = 0; j < i; ++j)y[i]=y[i]+(x[j]*h[i-j-1]);

while (not EOD)y = y + (x * h);

send y to SDQ

for (j = 0; j < i; ++j) {load (x[j]);load (h[i-j-1]);GET_SCQ;

}send (EOD token)send address of y[i] to SAQ

for (j = 0; j < i; ++j) {prefetch (x[j]);prefetch (h[i-j-1];PUT_SCQ;

}

Inner Loop Convolution

Computation Processor Code

Access Processor Code

Cache Management Code

SAQ: Store Address QueueSDQ: Store Data QueueSCQ: Slip Control QueueEOD: End of Data

General View of the Compiler

HiDISC Compilation Overview

Gcc

Source Program

Binary Code

Disassembler

Stream Separator

Access CodeComputation Code Cache Mgmt. Code

Sequential Source

Data Dependency Graph

Instruction Chasing for Backward Slice

Defining Load/Store Instructions

Computation Stream

Access Stream

Insert Communication Instructions

Computation Code

Access Code

Cache Management Code

Insert prefetch Instructions

HiDISC Stream Separator

Stream Separation: Backward Load/Store Chasing

lw $24, 24($sp) mul $25, $24, 8 la $8, z addu $9, $25, $8

l.d $f16, 88($9) l.d $f18, 0($sp) mul.d $f4, $f16, $f18 l.d $f6, 8($sp) l.d $f8, 80($9) mul.d $f10, $f6, $f8 add.d $f16, $f4, $f10 la $10, y addu $11, $25, $10 l.d $f18, 0($11) mul.d $f6, $f16, $f18 l.d $f8, 16($sp) add.d $f4, $f6, $f8 la $12, x addu $13, $25, $12 s.d $f4, 0($13)

x[k] = q + y[k]*( r*z[k+10] + t*z[k+11] );

Computation Stream

Access Stream

Backward chasing Communicate via LDQ

Communicate via SDQ

Other remaining instructions are sent to the Computation stream

LLL1 Load/store instructions and “backward slice” are included as Access stream

Communications between the two streams are also defined at this point

Stream Separation: Creating an Access Stream

Communication needs to take place via the various hardware queues.

Insert Communication instructions

CMP instructions are copied from access stream instructions, except that load instructions are replaced by prefetch instructions

LDQ

SDQ

lw $24, 24($sp) mul $25, $24, 8 la $8, z addu $9, $25, $8

l.d $LDQ, 88($9) l.d $LDQ, 0($sp) l.d $LDQ, 8($sp) l.d $LDQ, 80($9) la $10, y addu $11, $25, $10 l.d $LDQ, 0($11) l.d $LDQ, 16($sp) la $12, x addu $13, $25, $12 s.d $SDQ, 0($13)

x[k] = q + y[k]*( r*z[k+10] + t*z[k+11] );

Computation Stream

Access Stream

mul.d $f4, $LDQ, $LDQ mul.d $f10, $LDQ, $LDQ add.d $f16, $f4, $f10 mul.d $f6, $f16, $LDQ add.d $SDQ, $f6, $LDQ

DIS Benchmarks

Application oriented benchmarks Main characteristic:

Large data sets – non contiguous memory access and no temporal locality

DIS Benchmarks DescriptionMethod of Moments (MoM)

Computing the electromagnetic scattering from complex objects

Contains computational complexity and requests high memory speed

Done

Multidimensional Fourier Transform

(FFT)

Wide range of application usage

Image processing, synthesis, convolution deconvolution and digital signal filtering

Done

Data Management

(DM)

Traditional DBMS processing.

Focusing on index algorithms and ad hoc query processing. Done

Image Understanding

Sampling contains algorithms that perform spatial filtering and data reduction.

Morphological filter component, a region of interest (ROI) selection component, and a feature extraction component

Compiled

(Input file problem)

SAR Ray Tracing

Cost-effective alternative to real data collectionsSAR Ray-Tracing Benchmark utilizes an image-domain approach to SAR image simulation

Done

Atlantic Aerospace Stressmark Suite Smaller and more specific procedures

Seven individual data intensive benchmarks Directly illustrate particular elements of the DIS

problem Fewer computation operations than DIS

benchmarks Focusing on irregular memory accesses

Stressmark Suite

Name of Stressmark

Problem Memory Access

Pointer Pointer following Small blocks at unpredictable locations. Can be parallelized

Done

Update Pointer following with memory update

Small blocks at unpredictable location Done

Matrix Conjugate gradient simultaneous equation solver

Dependent on matrix representation Likely to be irregular or mixed, with mixed levels of reuse

Done

Neighborhood

Calculate image texture measures by finding sum and difference histograms

Regular access to pairs of words at arbitrary distances

Done

Field Collect statistics on large field of words

Regular, with little re-use Done

Corner-Turn Matrix transposition Block movement between processing nodes with practically nil computation

N/A

Transitive Closure

Find all-pairs-shortest-path solution for a directed graph

Dependent on matrix representation, but requires reads and writes to different matrices concurrently

Done

* DIS Stressmark Suite Version 1.0, Atlantic Aerospace Division

Pointer Stressmarks Basic idea: repeatedly follow pointers to randomized

locations in memory Memory access pattern is unpredictable (input data

dependent) Randomized memory access pattern:

Not sufficient temporal and spatial locality for conventional cache architectures

HiDISC architecture should provide lower average memory access latency The pointer chasing in AP can run ahead without

blocking by the CP

Simulation Parameters (simplescalar 3.0, 2000)

Branch predict mode Bimodal

Branch table size 2048

Issue width 4

Window size for superscalar RUU =16 LSQ = 8

Slip distance for AP/EP 50

Data L1 cache configuration 128 sets, 32 block, 4 -way set associative , LRU

Data L1 cache latency 1

Unified L2 cache configuration 1024 sets, 64 block, 4 - way set associative, LRU

Unified L2 cache latency 6

Integer functional units ALU( x 4), MUL/DIV

Floating point functional units ALU( x 4), MUL/DIV

Number of memory ports 2

Superscalar and HiDISC

Ability to vary the memory latency Superscalar supports OOO issue with 16 RUU

(Register Update Unit) and 8 LSQ (Load Store Queue)

Each of AP and CP issues instructions in order

DIS Benchmark/Stressmark Results

Stressmarks

Pointer Tran. Cl. Field Matrix Neighbor Update Corner

↑ ↑ ↓ ↓ ↓ ↓ N/A

DIS Benchmarks

FFT MoM DM IU RAY↑ ↑ ↑ N/A ↑

↑: better with HiDISC↓: better with Superscalar

DIS Benchmark Results

RayTracing/DIS

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

20 40 60 80 100 120

L2 Cache Miss Latency

IPC

Superscalar HiDISC (without CMP) HiDISC (25%) HiDISC (37.50%)

DM/DIS

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

20 40 60 80 100 120


IPC

Superscalar HiDISC (without CMP ) HiDISC (55.29%) HiDISC (84.71%)

FFT/DIS

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

20 40 60 80 100 120


IPC


MoM/DIS

1.85

1.9

1.95

2

2.05

2.1

2.15

1 2 3 4 5 6


IPC


DIS benchmarks perform extremely well in general with our decoupled processing for the following reasons: Many long latency floating-point operations Robust with longer memory latency (eg, Method of

Moment is a more stream-like process) In FFT, the HiDISC architecture also suffers from longer

memory latencies (Due to the data dependencies between two streams)

DM is not affected by the longer memory latency in either case.

DIS Benchmark Performance

Stressmark Results

Pointer/Stress

1.25

1.3

1.35

1.4

1.45

1.5

20 40 60 80 100 120


IPC

Superscalar HiDISC (without CMP) HiDISC (1.75%) HiDISC (74.08%)

Transitive Closure/Stress

0

0.2

0.4

0.6

0.8

1

1.2

20 40 60 80 100 120


IPC


Stressmark Results – Good Performance Pointer chasing can be executed far ahead by

using decoupled access stream It does not requires the computation results from

the CP Transitive closure also produces good results

Not much in the AP depend on the results of CP AP can run ahead

Some Weak Performance

Matrix/Stress

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

20 40 60 80 100 120


IPC


Update/Stress

1.4

1.45

1.5

1.55

1.6

1.65

1.7

20 40 60 80 100 120


IPC


Neighborhood/Stress

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

20 40 60 80 100 120


IPC

Superscalar HiDISC (without CMP ) HiDISC (50%) HiDISC (75%)

Field/Stress

1.5

1.55

1.6

1.65

1.7

1.75

1.8

20 40 60 80 100 120


IPC


Performance Bottleneck Too much synchronization causes loss of

decoupling Unbalanced code size between two streams

Stressmark suite is highly “access-heavy” Application domain for decoupled architectures:

Balanced ratio between computation and memory access

Synthesis Synchronization degrades performance

When the dependency of AP on CP increases, the slip distance of AP cannot be increased

More stream like applications will benefit from using HiDISC Multithreading support is needed

Applications should contain enough computation (1 to 1 ratio) to hide the memory access latency

CMP should be simple It executes redundant operations if the data is already in cache Cache pollution can occur

Flexi-DISC

Fundamental characteristics: inherently highly dynamic at

execution time. Dynamic reconfigurable central

computational kernel (CK) Multiple levels of caching and

processing around CK adjustable prefetching

Multiple processors on a chip which will provide for a flexible adaptation from multiple to single processors and horizontal sharing of the existing resources.

Memory Interface Ring

Low Level Cache Access Ring

Computation Kernel

Flexi-DISC Partitioning of Computation Kernel

It can be allocated to the different portions of the application or different applications

CK requires separation of the next ring to feed it with data

The variety of target applications makes the memory accesses unpredictable

Identical processing units for outer rings

Highly efficient dynamic partitioning of the resources and their run-time allocation can be achieved

Memory Interface Ring

Low Level Cache Access Ring

Computation Kernel

Application 1

Application 2

Application 3

Documents

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing