20
“Nahalal: Cache Organization for Chip Multiprocessors” New LSU Policy By : Ido Shayevitz and Yoav Shargil Supervisor: Zvika Guz Software Systems Lab Department of Electrical Engineering, Technion

“ Nahalal: Cache Organization for Chip Multiprocessors ” New LSU Policy

  • Upload
    maik

  • View
    32

  • Download
    1

Embed Size (px)

DESCRIPTION

Software Systems Lab Department of Electrical Engineering, Technion. “ Nahalal: Cache Organization for Chip Multiprocessors ” New LSU Policy. By : Ido Shayevitz and Yoav Shargil Supervisor: Zvika Guz. NAHALAL PROJECT This project is based on the article : - PowerPoint PPT Presentation

Citation preview

Page 1: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

“Nahalal: Cache Organization for Chip Multiprocessors”

New LSU PolicyBy : Ido Shayevitz and Yoav Shargil

Supervisor: Zvika Guz

Software Systems Lab

Department of Electrical Engineering, Technion

Page 2: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

NAHALAL PROJECTThis project is based on the article : “Nahalal: Cache Organization for Chip

Multiprocessors “ by Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser

NAHALAL Project is a project that deals with an environment of multiprocessors chip.

Project Goal : To give the appropriate solution for organizing a memory cache for multiprocessors environment.

Page 3: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

Project Steps1. Reading article and look for suitable idea.

2. Learning Simics Simulator.

3. Installing Simics 3 on project host.

4. Learning g-cache module C code + NAHALAL code.

5. Writing implementation code of our project.

6. Define and Writing benchmarks.

7. Testing and debugging our model.

8. Run simulations, collect statistics and find conclusions.

9. Writing project book, presentation and website.

Page 4: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

NAHALAL ARCHTECTURE

NAHALAL archtecture defines the memory cache banks of the L2 cache.

The basic distinction is between a private bank to a public bank.

CIM NAHALAL

Page 5: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

NAHALAL ARCHTECTURE

Placement Policy Replacement Policy from Private Bank : LRU Replacement Policy from Public Bank : LRU

NAHALAL

Page 6: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

LSU Improvement

Placement Policy Replacement Policy from Private Bank : LRU Replacement Policy from Public Bank : LSU

NAHALAL

Page 7: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

LSU Implementation Way 1 Shift-register with N cells for each CPU.

Each cell in the shift-register hold line num.

In throwing by CPUi : for each line in the shift register need to check which less appear in other CPUs shift-registers.

64Bytes line size, 1MB public bank size 14 bits for each cell in the shift register if N=8 then memory overhead : 8*8*14 = 896 bits

Candidates for replacement are only 8 in the shift register !

In case of a line no one access, it will stuck forever in the public bank.

Complicated, long time algorithm in HW.

Page 8: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

LSU Implementation Way 2 Shift-register with N cells for each Line.

Each cell in the shift-register hold CPU num

In throwing by CPUi : Search in all the shift-registers, which one contain CPUi the most.

8 CPUs 3 bits for each cell in the shift register. 64Bytes line size, 1MB public bank size 2 lines in public bank. If N=8 Therefore 2 *8*3 = 0.375MB 37.5% memory overhead.

37.5% memory overhead can cause to large distance between lines in public bank and CPUs

37.5% memory overhead enlarge power.

Complicated, long time algorithm in HW.

1414

Page 9: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

LSU Implementation Way 3

Shift-register with N cells for each Line.

Each cell in the shift-register hold CPU num

In throwing by CPUi : For each shift-register do XOR between each cell and the ID of CPUi. The shift-register on which the XOR produce 0, will be the chosen one. If non produce 0 then do regular LRU.

In order ro reduce memory overhead, define N=4. Therefore 2 *4*3 = 0.1875MB 18.75% memory overhead.

Simple, short time algorithm in HW.

14

Page 10: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

Simics Simulator In-order Instruction Set Simulator. Simulating code on virtual

machine.

Each instruction run in one step time.

Collecting statistics during simulations.

Build models for all types of HW units and devices.

Memory Transactions, memory spaces, timing-model interface.

G-cache module for the basis to define memory hierarchy and cache modules.

Stall time is the key for cache model simulations.

Round Robin with default quantum for multiprocessors simulation.

Page 11: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

The Implementation

In the project book.…

Page 12: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

Build our projectUsing Simics g-cache Makefile :

shargil@slab-i05:/home/users/shargil/Simics/simics-3.0.30/x86-linux/lib> gmake cleanRemoving: g-cacheshargil@slab-i05:/home/users/shargil/Simics/simics-3.0.30/x86-linux/lib> gmake g-cacheGenerating: modules.cache=== Building module "g-cache" ===Creating dependencies: module_id.cCreating dependencies: sorter.cCreating dependencies: splitter.cCreating dependencies: gc.cCreating dependencies: gc-interface.cCreating dependencies: gc-attributes.cCreating dependencies: gc-cyclic-repl.cCreating dependencies: gc-lru-repl.cCreating dependencies: gc-random-repl.cCreating dependencies: gc-common-attributes.cCreating dependencies: gc-common.cCreating exportmap.elfCompiling gc-common.cCompiling gc-common-attributes.cCompiling gc-random-repl.cCompiling gc-lru-repl.cCompiling gc-cyclic-repl.cCompiling gc-attributes.cCompiling gc-interface.cCompiling gc.cCompiling module_id.cLinking g-cache.soshargil@slab-i05:/home/users/shargil/Simics/simics-3.0.30/x86-linux/lib>

Page 13: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

Simulation ScriptUsing pyhton script we defined :

Page 14: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

Writing BenchmarksWriting Benchmarks is done in the simulated target console :

Page 15: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

Writing Benchmarks Using Threads with pthread library

Each Thread is associated to a CPU using sched library.

Parallel code is written in the benchmark

Also OS code and pthread code cause to Parallel code.

Each benchmark we run first without LSU and second with LSU.

Our Goals :

1. To show that LSU reduce cycles in parallel benchmarks.

2. To show that in “LSU dedicated benchmarks” the

improvement is better.

Page 16: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

Benchmark AEach thread associated to a different CPU and read from public array and then from private array :

// read shared arrayfor (i=0; i<NUM_OF_ITERATIONS; i++)

for (j=0; j<SHARED_ARRAY_SIZE; j++)tmp = SHARED_ARRAY[j];

// write to desginated array :accum=0;if (thread_num == 1) {

des_arrray_ptr = DESGINATED_ARRAY1;}else if (thread_num == 2) {

des_arrray_ptr = DESGINATED_ARRAY2;}else { printf ("Error 0x100: thread_num has unexpexted value\n"); exit(1);}

for (i=0; i<NUM_OF_ITERATIONS; i++)for (j=0; j<DESIGNATED_ARRAY_SIZE; j++)

*(des_arrray_ptr+j)=1;

return NULL; }

Page 17: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

Benchmark BEach thread associated to a different CPU and read from public array, private part, and public part again.

// read first shared array – move first array to public bankfor (i=0; i<NUM_OF_ITERATIONS; i++)

for (j=0; j<SHARED_ARRAY_SIZE; j++)tmp = SHARED_ARRAY1[j];

// now each thread reads only half part of the shared arrayif (thread_num == 1) {

for (i=0; i<NUM_OF_ITERATIONS; i++)for (j=0; j<(SHARED_ARRAY_SIZE/2); j++)

tmp = SHARED_ARRAY1[j];}else if (thread_num == 2) {

for (i=0; i<NUM_OF_ITERATIONS; i++) for (j=(SHARED_ARRAY_SIZE/2); j<SHARED_ARRAY_SIZE; j++)

tmp = SHARED_ARRAY1[j];}

// read second arrayfor (i=0; i<NUM_OF_ITERATIONS; i++)

for (j=0; j<SHARED_ARRAY_SIZE; j++)tmp = SHARED_ARRAY2[j];

return NULL;}

Page 18: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

Collecting Statistics Example

Cache statistics: l2c----------------- Total number of transactions: 610349 Total memory stall time: 31402835 Total memory hit stall time: 28251635

Device data reads (DMA): 0 Device data writes (DMA): 0

Uncacheable data reads: 17 Uncacheable data writes: 30738 Uncacheable instruction fetches: 0

Data read transactions: 403488 Total read stall time: 17488735 Total read hit stall time: 14383135 Data read remote hits: 0 Data read misses: 10352 Data read hit ratio: 97.43% Instruction fetch transactions: 0 Instruction fetch misses: 0

Data write transactions: 176106 Total write stall time: 4687600 Total write hit stall time: 4687600 Data write remote hits: 0 Data write misses: 0 Data write hit ratio: 100.00%

Copy back transactions: 0

Number of replacments in the middle (NAHALAL): 557

Page 19: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

Results

Improvement of 54% in average stall time per transaction in benchmark A, and 61% improvement in benchmark B.

In Benchmark A 8.375% from the transactions cause a replacement in the middle without LSU, and with LSU only 0.09% ! ∆=8.28%

In Benchmark B 8.75% from the transactions cause a replacement in the middle without LSU, and with LSU only 0.02% ! ∆=8.73%

Page 20: “ Nahalal: Cache Organization for Chip Multiprocessors ”            New  LSU  Policy

ConclusionsLSU policy significantly improve average stall time per transaction, Therefore :

LSU Policy implemented in NAHALAL architecture significantly reduce number of cycles for a benchmark.

LSU policy significantly reduce number of replacements in the middle, Therefore :

LSU Policy implemented in NAHALAL architecture, better keep the hot shared lines in the public bank.

According to our implementation, LRU is activated if LSU did not find a line, Therefore :

LSU Policy as we implemented is always preferable then LRU.