CacheMiner : Run-Time Cache Locality Exploitation on SMPs CPU On-chip cache Off-chip cache Interconnection Network Shared Memory CPU On-chip cache Off-chip

CacheMiner CacheMiner : Run-Time Cache Locality Exploitation on SMPs: Run-Time Cache Locality Exploitation on SMPs

CPU

On-chip cache

Off-chip cache

Interconnection Network

Shared Memory

CPU

On-chip cache

Off-chip cache

CPU

On-chip cache

Off-chip cache

Example Program transformations for cache locality : Tiling

for i = 1 to n for j = 1 to n for k = 1, n A[i,j ] = A[i,j] + B[i,k] * C[k,j]

For a matrix multiplication of 1000 x 1000 …

X=

X=

X=

Data accessed1000 1000 1000

1000

32

32

1000 32

1000

1002000

65024

But it’s hard for the compiler to analyse indirect accesses..

void myfunc( int source_arr[] , int key_arr[] , int result_arr[], int n){ for(I=0;I< n ;I++) { result_arr[I] += source_arr[ key_arr[ I] ] ; // Indirection ! }}

• The data access pattern of the function depends on the contents of key_arr[] .

• So the data access pattern cannot be determined at compile time, but only at run-time.

• Cacheminer is especially useful for such scenarios.

Targetted ModelTargetted Model

For ( i1 = lower_1 ; i1 < upper_1 ; i1 ++) For ( i2 = lower_2 ; i2 < upper_2 ; i2 ++)

For ( i3 = lower_3 ; i3 < upper_3 ; i3 ++)

For ( ik = lower_k ; ik < upper_k ; ik ++){ Task B = block of statements;}

k nested loops

Let B ( t1, t2…tk) : task B where t1 , t2..tk represent particular values of variables i1 , i2 ..ik respectively

The tasks need to be data independent of each other i.e :

Out (B1) Out (B2) = { empty set }Out (B1) In (B2) = { empty set } In (B1) Out (B2) = { empty set }

System OverviewSystem Overview

program

Hint Addition

Access Pattern Estimation

Task Grouping

Task Partitioning

Task Scheduling

CompilerLibrary

• C program

• Add calls to library functions which provide hints to the run-time system

• Use Hints to estimate the pattern of accesses.

• Group together tasks which access closely placed

data into bins .

• Partition total bins among P processors to maximize

data locality and also loadsharing.

• Schedule Tasks on the processor. Ensure overall load-balancing

1

2

3

4

Step 1 : Estimating Memory AccessesStep 1 : Estimating Memory Accesses

• Assumption : Task B accesses only chunks of elements in multiple arrays

• 4 Hints provided to the module :

a. Number of Arrays accessed : n (Compile Time)

b. Size in bytes of each array : Vector (s1,s2…sn) (Compile Time)

c. Number of processors : p (Compile Time).

d. Access footprint B(a1,a2,….an) : starting access address for n arrays for the Task B. (Run Time).

• Each Task can then be a point B(a1,a2,a3..an) in n -dimensional space.

Example : int P [ 100 ] and int Q[ 200]. Memory Layout of P : size = 100 * sizeof(int) = 400 : starting address : &P[0] = 1000Memory Layout of Q : size = 200 * sizeof(int) = 800 : starting address : &Q[0] = 100.

Access dimension in P -->

Acc

ess

dim

ensi

on in

Q -

->

1000 1400

100

900

Each Task B(x ,y) is a point in the 2-dimensional gridx : starting access address of array1 (P) for Task y : starting access address of array2 (Q) for Task

B1 ( 1000, 900)

B2 ( 1000, 100)

Step 2 : Grouping Tasks By LocalityStep 2 : Grouping Tasks By Locality

A. Shift to Origin.

Access dimension in P -->

Acc

ess

dim

ensi

on in

Q -

->

1000 1400

100

900

B1 ( 1000, 900)

B2 ( 1000, 100)

4000

800

B. Shrink the Dimensions by (C/n) :

In example : n = 2, cache size = 200So shrink dimension by 200/2 = 100

40

8 Bins

Step 3 : Partitioning Bins among ‘P’ ProcessorsStep 3 : Partitioning Bins among ‘P’ Processors

• Need to form ‘P’ groups of bins such that the sharing between them is minimized.• Problem is NP-complete, so use a heuristic method to divide up the bin space. i. Form prime factors of ‘P’ and divide each dimension of bin-space

into Rj chunks , for each Prime factor Rj.

Example : Suppose we have 6 processors :

6 = 2 x 3 So ‘x’ dimension divided into 2 parts. ‘y’ dimension divided into 3 parts. Thus, a total of 2 x 3 = 6 distinct regions ! (all bins in 1 region are processed by one processor).

8

40

Distinct regions

Step 4 : Adaptive Scheduling of Task GroupsStep 4 : Adaptive Scheduling of Task Groups

Bin

Processor Task List

BinBinBin Bin Bin

Take ‘K’ bins at a time

Local SchedulingLocal Scheduling : Each processor processes bins from its own Task-list.

Global SchedulingGlobal Scheduling : When a processor finishes its task list, it starts processing the task list of the most heavily loaded processor.

Adaptive ControlAdaptive Control :

Processor takes ‘K’ bins at a time to process. K changes depending on no. of remaining bins

max ( p /2 , Ki - 1) if few bins remain in tasklist (light load) min ( 2p , Ki + 1) if lots of bins remain in tasklist (heavy load)Ki =

Results

Static Access Pattern

Static Access Pattern

Dynamic Access Pattern

Manually optimizedManually optimized With CacheminerWith Cacheminer

• Framework to exploit Run-Time Cache Locality on SMPs

• Targetted at nested-loop structures accessing number of arrays.

• Especially useful for indirect accesses where data access pattern cannot be determined till run-time.

• Overall phases :

Summary Summary

program Hint Addition Access Pattern Estimation

Task Grouping

Task Partitioning

Task Scheduling

Compiler

Library

Documents

CacheMiner : Run-Time Cache Locality Exploitation on SMPs CPU On-chip cache Off-chip cache Interconnection Network Shared Memory CPU On-chip cache Off-chip