“ Nahalal: Cache Organization for Chip Multiprocessors ” New LSU Policy

“Nahalal: Cache Organization for Chip Multiprocessors”

New LSU PolicyBy : Ido Shayevitz and Yoav Shargil

Supervisor: Zvika Guz

Software Systems Lab

Department of Electrical Engineering, Technion

NAHALAL PROJECTThis project is based on the article : “Nahalal: Cache Organization for Chip

Multiprocessors “ by Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser

NAHALAL Project is a project that deals with an environment of multiprocessors chip.

Project Goal : To give the appropriate solution for organizing a memory cache for multiprocessors environment.

Project Steps1. Reading article and look for suitable idea.

2. Learning Simics Simulator.

3. Installing Simics 3 on project host.

4. Learning g-cache module C code + NAHALAL code.

5. Writing implementation code of our project.

6. Define and Writing benchmarks.

7. Testing and debugging our model.

8. Run simulations, collect statistics and find conclusions.

9. Writing project book, presentation and website.

NAHALAL ARCHTECTURE

NAHALAL archtecture defines the memory cache banks of the L2 cache.

The basic distinction is between a private bank to a public bank.

CIM NAHALAL

NAHALAL ARCHTECTURE

Placement Policy Replacement Policy from Private Bank : LRU Replacement Policy from Public Bank : LRU

NAHALAL

LSU Improvement

Placement Policy Replacement Policy from Private Bank : LRU Replacement Policy from Public Bank : LSU

NAHALAL

LSU Implementation Way 1 Shift-register with N cells for each CPU.

Each cell in the shift-register hold line num.

In throwing by CPUi : for each line in the shift register need to check which less appear in other CPUs shift-registers.

64Bytes line size, 1MB public bank size 14 bits for each cell in the shift register if N=8 then memory overhead : 8*8*14 = 896 bits

Candidates for replacement are only 8 in the shift register !

In case of a line no one access, it will stuck forever in the public bank.

Complicated, long time algorithm in HW.

LSU Implementation Way 2 Shift-register with N cells for each Line.

Each cell in the shift-register hold CPU num

In throwing by CPUi : Search in all the shift-registers, which one contain CPUi the most.

8 CPUs 3 bits for each cell in the shift register. 64Bytes line size, 1MB public bank size 2 lines in public bank. If N=8 Therefore 2 *8*3 = 0.375MB 37.5% memory overhead.

37.5% memory overhead can cause to large distance between lines in public bank and CPUs

37.5% memory overhead enlarge power.

Complicated, long time algorithm in HW.

1414

LSU Implementation Way 3

Shift-register with N cells for each Line.

Each cell in the shift-register hold CPU num

In throwing by CPUi : For each shift-register do XOR between each cell and the ID of CPUi. The shift-register on which the XOR produce 0, will be the chosen one. If non produce 0 then do regular LRU.

In order ro reduce memory overhead, define N=4. Therefore 2 *4*3 = 0.1875MB 18.75% memory overhead.

Simple, short time algorithm in HW.

14

Simics Simulator In-order Instruction Set Simulator. Simulating code on virtual

machine.

Each instruction run in one step time.

Collecting statistics during simulations.

Build models for all types of HW units and devices.

Memory Transactions, memory spaces, timing-model interface.

G-cache module for the basis to define memory hierarchy and cache modules.

Stall time is the key for cache model simulations.

Round Robin with default quantum for multiprocessors simulation.

The Implementation

In the project book.…

Build our projectUsing Simics g-cache Makefile :

shargil@slab-i05:/home/users/shargil/Simics/simics-3.0.30/x86-linux/lib> gmake cleanRemoving: g-cacheshargil@slab-i05:/home/users/shargil/Simics/simics-3.0.30/x86-linux/lib> gmake g-cacheGenerating: modules.cache=== Building module "g-cache" ===Creating dependencies: module_id.cCreating dependencies: sorter.cCreating dependencies: splitter.cCreating dependencies: gc.cCreating dependencies: gc-interface.cCreating dependencies: gc-attributes.cCreating dependencies: gc-cyclic-repl.cCreating dependencies: gc-lru-repl.cCreating dependencies: gc-random-repl.cCreating dependencies: gc-common-attributes.cCreating dependencies: gc-common.cCreating exportmap.elfCompiling gc-common.cCompiling gc-common-attributes.cCompiling gc-random-repl.cCompiling gc-lru-repl.cCompiling gc-cyclic-repl.cCompiling gc-attributes.cCompiling gc-interface.cCompiling gc.cCompiling module_id.cLinking g-cache.soshargil@slab-i05:/home/users/shargil/Simics/simics-3.0.30/x86-linux/lib>

Simulation ScriptUsing pyhton script we defined :

Writing BenchmarksWriting Benchmarks is done in the simulated target console :

Writing Benchmarks Using Threads with pthread library

Each Thread is associated to a CPU using sched library.

Parallel code is written in the benchmark

Also OS code and pthread code cause to Parallel code.

Each benchmark we run first without LSU and second with LSU.

Our Goals :

1. To show that LSU reduce cycles in parallel benchmarks.

2. To show that in “LSU dedicated benchmarks” the

improvement is better.

Benchmark AEach thread associated to a different CPU and read from public array and then from private array :

// read shared arrayfor (i=0; i<NUM_OF_ITERATIONS; i++)

for (j=0; j<SHARED_ARRAY_SIZE; j++)tmp = SHARED_ARRAY[j];

// write to desginated array :accum=0;if (thread_num == 1) {

des_arrray_ptr = DESGINATED_ARRAY1;}else if (thread_num == 2) {

des_arrray_ptr = DESGINATED_ARRAY2;}else { printf ("Error 0x100: thread_num has unexpexted value\n"); exit(1);}

for (i=0; i<NUM_OF_ITERATIONS; i++)for (j=0; j<DESIGNATED_ARRAY_SIZE; j++)

*(des_arrray_ptr+j)=1;

return NULL; }

Benchmark BEach thread associated to a different CPU and read from public array, private part, and public part again.

// read first shared array – move first array to public bankfor (i=0; i<NUM_OF_ITERATIONS; i++)

for (j=0; j<SHARED_ARRAY_SIZE; j++)tmp = SHARED_ARRAY1[j];

// now each thread reads only half part of the shared arrayif (thread_num == 1) {

for (i=0; i<NUM_OF_ITERATIONS; i++)for (j=0; j<(SHARED_ARRAY_SIZE/2); j++)

tmp = SHARED_ARRAY1[j];}else if (thread_num == 2) {

for (i=0; i<NUM_OF_ITERATIONS; i++) for (j=(SHARED_ARRAY_SIZE/2); j<SHARED_ARRAY_SIZE; j++)

tmp = SHARED_ARRAY1[j];}

// read second arrayfor (i=0; i<NUM_OF_ITERATIONS; i++)

for (j=0; j<SHARED_ARRAY_SIZE; j++)tmp = SHARED_ARRAY2[j];

return NULL;}

Collecting Statistics Example

Cache statistics: l2c----------------- Total number of transactions: 610349 Total memory stall time: 31402835 Total memory hit stall time: 28251635

Device data reads (DMA): 0 Device data writes (DMA): 0

Uncacheable data reads: 17 Uncacheable data writes: 30738 Uncacheable instruction fetches: 0

Data read transactions: 403488 Total read stall time: 17488735 Total read hit stall time: 14383135 Data read remote hits: 0 Data read misses: 10352 Data read hit ratio: 97.43% Instruction fetch transactions: 0 Instruction fetch misses: 0

Data write transactions: 176106 Total write stall time: 4687600 Total write hit stall time: 4687600 Data write remote hits: 0 Data write misses: 0 Data write hit ratio: 100.00%

Copy back transactions: 0

Number of replacments in the middle (NAHALAL): 557

Results

Improvement of 54% in average stall time per transaction in benchmark A, and 61% improvement in benchmark B.

In Benchmark A 8.375% from the transactions cause a replacement in the middle without LSU, and with LSU only 0.09% ! ∆=8.28%

In Benchmark B 8.75% from the transactions cause a replacement in the middle without LSU, and with LSU only 0.02% ! ∆=8.73%

ConclusionsLSU policy significantly improve average stall time per transaction, Therefore :

LSU Policy implemented in NAHALAL architecture significantly reduce number of cycles for a benchmark.

LSU policy significantly reduce number of replacements in the middle, Therefore :

LSU Policy implemented in NAHALAL architecture, better keep the hot shared lines in the public bank.

According to our implementation, LRU is activated if LSU did not find a line, Therefore :

LSU Policy as we implemented is always preferable then LRU.

Documents

“ Nahalal: Cache Organization for Chip Multiprocessors ” New LSU Policy