Analysis tools for Evaluation and
Performance
Mourad BouachePhD, Computer Architecture
Oracle - Nov, 14-2011
Introduction
Processors are increasingly complex
• More difficult microarchitecturedesign.
Simulator : very important tool
• Understand the instructionbehavior during its execution inprocessor.
Complex Simulator :
• Time for preparation andmodification.
2/1
Introduction
Simulation
3/1
Simulator
Tool
• Simulator : very important tool
• test new concepts
4/1
Simulator
Tool
• Simulator : very important tool
• test new concepts
Three characteristics
4/1
Complexity of microarchitectures
5/1
Modular Simulation
6/1
Speed decreases as complexity increases
7/1
Contribution : vectorization methodology
8/1
Monolithic Simulation
Simplescalar, is the most used (in 70% of articles).This simulator and most other simulators have a seriousdrawback : monolithic
Advantage
• simulation speed
9/1
Monolithic Simulation
Simplescalar, is the most used (in 70% of articles).This simulator and most other simulators have a seriousdrawback : monolithic
Advantage
• simulation speed
Disadvantages
• Difficult to update.
• Difficult to extract and compare the simulator components.
9/1
Monolithic vs Modular
10/1
Modular simulation
Advantages
• Reuse/ exchange and compare simulator modules,
11/1
Modular simulation
Advantages
• Reuse/ exchange and compare simulator modules,
• Better confidence in simulation (closer to HW),
11/1
Modular simulation
Advantages
• Reuse/ exchange and compare simulator modules,
• Better confidence in simulation (closer to HW),
• Easier to read.
11/1
Modular simulation
Advantages
• Reuse/ exchange and compare simulator modules,
• Better confidence in simulation (closer to HW),
• Easier to read.
Main drawback :
• Simulation speed slowdown
11/1
Outline
1 Modular simulation environment
2 Acceleration techniques
3 Vectorization of Simulator Modules
4 Experimental framework
5 Results
6 Scheduling process in SystemC
7 Conclusion & future works.
12/1
Modular simulation environments
• A modular simulation environment describe hierarchically andstructurally the system to simulate.To simulate the entire system, the environment includes ascheduler controlling the performance of different components.
13/1
Modular simulation environments
• A modular simulation environment describe hierarchically andstructurally the system to simulate.To simulate the entire system, the environment includes ascheduler controlling the performance of different components.
• Key benefits :
13/1
Modular simulation environments
Reuse ...
14/1
Modular simulation environments
Compare ...
15/1
Modular simulation environments
Share ...
16/1
Simulation models
17/1
acceleration techniques
Acceleration techniques
• reduction of inputs and simulation programs : MinneSPEC,
• simulation engine optimization : FastSysC 1(speedX 2),
• distribution of simulation : DisT,
• sampling techniques : representative, periodic and
random sampling,
• transition to modeling TTLM :Timed Transaction LevelModeling.
1. Daniel Gracia Perez et al. FastSysC : a fast SystemC engine18/1
acceleration techniques
Acceleration techniques
• Compromise between accuracy and simulation speed,
2. David Parello, Mourad Bouache, and Bernard Goossens. Improving cycle-level modular simulation by vec-torization. In Rapid Simulation and Performance Evaluation : Methods and Tools (RAPIDO’09)
19/1
acceleration techniques
Acceleration techniques
• Compromise between accuracy and simulation speed,
• Vectorization 2 is a methodology that can be used withone of these acceleration techniques.
2. David Parello, Mourad Bouache, and Bernard Goossens. Improving cycle-level modular simulation by vec-torization. In Rapid Simulation and Performance Evaluation : Methods and Tools (RAPIDO’09)
19/1
Modular simulation environment
UNISIM 3 : A modular simulation framework
• UNISIM is a modular framework for simulation, each simulatoris divided into several modules, each module corresponding toa hardware block.
3. http ://www.unisim.org/20/1
Modular simulation environment
UNISIM 3 : A modular simulation framework
• UNISIM is a modular framework for simulation, each simulatoris divided into several modules, each module corresponding toa hardware block.
• A module is composed of two parts : state and processes.
3. http ://www.unisim.org/20/1
Modular simulation environment
UNISIM : A modular simulation framework
• A process is defined in a .sim file as a C++ class
21/1
UNISIM : Communication protocol
Communication protocol
• Ports : inports and outports
• Signals
22/1
UNISIM : Communication protocol
Communication protocol
• Ports : inports and outports
• Signals
3 signals :
• Processes can be sensitive tothe data, the accept andthe enable signals.
22/1
Communication protocol
UNISIM : signals
• The simulation engine (SystemC) wakes up the modulesprocess.
23/1
UNISIM : Communication protocol
Communication between modules
24/1
UNISIM : Communication protocol
Communication between modules
25/1
UNISIM : Communication protocol
Communication between modules
26/1
UNISIM : Communication protocol
Communication between modules
27/1
UNISIM : Communication protocol
Communication between modules
28/1
UNISIM : Communication protocol
Communication between modules
29/1
UNISIM : Communication protocol
Communication between modules
30/1
UNISIM : Communication protocol
Communication between modules
31/1
UNISIM : Communication protocol
Communication between modules
32/1
Communication protocol
Communication between modulesScalability is difficult with a modular simulation, for two factors :
• Communication costs between the simulator modules.
• Awakening process for each communicating module.
33/1
Communication costs
Monolithic Simulator
• Write/read a variable.
34/1
Communication costs
Monolithic Simulator
• Write/read a variable.
Modular Simulator
34/1
A New Communication Protocol
Signals Array
• Reduce the number of signals,
• Several values of data, accept, enable temporarily stored insignals array.
35/1
A New Communication Protocol
Signals Array
• An extension of the communication protocol between modulesis a solution to accelerate a simulation speed.
36/1
Module Vectorization
A simple and systematic procedure
1 vectorize module state and ports,
2 add a loop around the process,
3 add method calls to send() following the addition of forloops.
37/1
Example : Functional Unit
1 class FunctionalUnit : public module2 { public:3 inclock clock;4 inport <instr > in;5 outport <instr > out;6 FunctionalUnit (const char*name): module(name)7 { sensitive_pos_method (start_of_cycle ) << clock;8 sensitive_neg_method (end_of_cycle ) << clock;9 sensitive_method ( on_data_accept ) << in.data << out.accept;
10 }11 void start_of_cycle ()12 { if (pipeline.is_ready ())13 out.data = pipeline.get ();14 else out.data.nothing ();15 }16 void on_data_accept ()17 { if (in.data.know() && out.accept.know())18 { if (! pipeline.is_full() || out .accept)19 in.accept = true;20 else in.accept = false;21 out.enable = out.accept;22 }23 }24 void end_of_cycle ()25 { if (out.accept) pipeline.pop ();26 if (in.enable) pipeline.push(in.data);27 pipeline.run ();28 }29 private:30 Fifo <instr > pipeline;31 };
38/1
Module Vectorization
Vectorization Procedure1. vectorize module state and ports.
1 class FunctionalUnit : public module2 { public:3 inclock clock;4 inport <instr > in;5 outport <instr > out;6 ...7 private:8 Fifo <instr > pipeline;
1 class FunctionalUnit : public module2 { public:3 inclock clock;4 inport <instr , NBCFG > in;5 outport <instr , NBCFG > out;6 ...7 private:8 Fifo <instr > pipeline[NBCFG];
39/1
Module Vectorization
Vectorization procedure
2. add a loop around the process.
1 ...2 void start_of_cycle ()3 { if (pipeline.is_ready ())4 out.data = pipeline.get ();5 else out.data.nothing ();6 }7 void on_data_accept ()8 { if (in.data.know() && out.accept.know())9 { if (! pipeline.is_full() || out .accept)
10 in.accept = true;11 else in.accept = false;12 out.enable = out.accept;13 }14 }15 ...
1 ...2 void start_of_cycle ()3 { for (int cfg =0; cfg <NBCFG; cfg ++)4 {5 if (pipeline[cfg ]. is_ready ())6 out.data[cfg] = pipeline[cfg ].get ();7 else out .data[cfg ].nothing ();8 ...9 }
10 }11 void on_data_accept ()12 { if (in.data.know() && out.accept.know())13 { for (int cfg =0; cfg <NBCFG; cfg ++)14 { if (! pipeline[cfg ]. is_full()15 || out.accept[cfg ])16 in.accept[cfg] = true;17 else in.accept[cfg] = false;18 out .enable[cfg ] = out.accept[cfg ];19 ...20 }21 }22 }23 ...
40/1
Module Vectorization
Vectorization procedure
3. add method calls to send() following the addition of for loops.
1 ...2 void start_of_cycle ()3 { if (pipeline.is_ready ())4 out.data = pipeline.get ();5 else out.data.nothing ();6 }7 void on_data_accept ()8 { if (in.data.know() && out.accept.know())9 { if (! pipeline.is_full() || out .accept)
10 in.accept = true;11 else in.accept = false;12 out.enable = out.accept;13 }14 }15 ...
1 ...2 void start_of_cycle ()3 { for (int cfg =0; cfg <NBCFG; cfg ++)4 {5 if (pipeline[cfg ]. is_ready ())6 out.data[cfg] = pipeline[cfg ].get ();7 else out .data[cfg ].nothing ();8 }9 out.data.send();
10 }11 void on_data_accept ()12 { if (in.data.know() && out.accept.know())13 { for (int cfg =0; cfg <NBCFG; cfg ++)14 { if (! pipeline[cfg ]. is_full()15 || out.accept[cfg ])16 in.accept[cfg] = true;17 else in.accept[cfg] = false;18 out .enable[cfg ] = out.accept[cfg ];19 }20 in.accept.send();21 out .enable.send();22 }23 }24 ...
41/1
Example : Vectorized Functional Unit
1 class FunctionalUnit : public module2 { public:3 inclock clock;4 inport <instr , NBCFG > in;5 outport <instr , NBCFG > out;6 FunctionalUnit (const char*name): module(name)7 { // sensitive list8 sensitive_pos_method (start_of_cycle ) << clock;9 sensitive_neg_method (end_of_cycle ) << clock;
10 sensitive_method ( on_data_accept ) << in.data << out.accept;11 }12 void start_of_cycle ()13 { for (int cfg =0; cfg< NBCFG; cfg ++)14 {15 if (pipeline[cfg ]. is_ready ())16 out.data[cfg] = pipeline[cfg ]. get ();17 else out.data[cfg ]. nothing ();18 }19 out .data.send();20 }21 void on_data_accept ()22 { if (in.data.know() && out.accept.know())23 { for (int cfg =0; cfg< NBCFG; cfg ++)24 { if (! pipeline[cfg ]. is_full() || out.accept[cfg ])25 in.accept[cfg ] = true;26 else in.accept[cfg] = false;27 out .enable[cfg] = out.accept[cfg ];28 }29 in.accept. send();30 out .enable.send();31 }32 }33 void end_of_cycle ()34 { for (int cfg =0; cfg< NBCFG; cfg ++)35 { if (out.accept[cfg ]) pipeline[cfg ]. pop ();36 if (in.enable[cfg ]) pipeline[cfg ].push(in.data);37 pipeline[cfg ].run ();38 }39 }40 private:41 Fifo <instr > pipeline[NBCFG];42 };
42/1
Simulator Vectorization
Multi-cores Simulation
• In our study, we performed simulations of multi-cores : 2, 4, 8,16, 32 and 64.
43/1
OoOSim : Out of Order Simulator
OoOSim 4 modelises a generic superscalar out-of-order processor.The baseline simulator includes a 4-way superscalar core with an L1instruction cache, an L1 write-back data cache, a bus and a dram.
4. Mourad Bouache, David Parello, Bernard Goossens. Acceleration of Modular simulation. In InternationalSupercomputing Conference (ISC09) Hamburg, Germany, June 2009.
44/1
OoOSim : Out of Order Simulator
OoOSim : 12 modules
1 Fetcher,
2 AllocatorRenamer,
3 Dispatcher,
4 Scheduler,
5 RegisterFile,
6 Ret-Broadcast and CDBA:Common Data Bus Arbiter,
7 IntegerUnit, FloatingPointUnit and AddressGenerationUnit,
8 LoadStoreQueue,
9 Data caches L1 and L2,
10 Instruction cache L1,
11 Memory DRAM,
12 Reorder Buffer.
45/1
OoOSim : Out of Order Simulator
more than 15.000 code lines, 12 connected modules through 187 signals.46/1
Benchmarks
Benchmarks : MiBench
• Simulations were carried out by MiBench, divided into sixsuites targeted areas specific market for embeddedapplications :Automotive, Network, Security, Consumer Devices,
Office Automation, and Telecommunications.
Auto./Industrial Consummer Office Network Security Telecomm.
susan (edges) jpeg stringsearch dijkstra sha FFTsusan (corners) - - - rijndael -susan (smoothing) - - - - -
47/1
Performance evaluation
Simulation machine
• Performance evaluation has been carried out on a cluster of30 Intel Xeon 5148 dual-core processors clocked at2.33GHz with a 4MBytes L2 cache.
48/1
Results : simulation speed (without vectorization)
49/1
simulation speed (with vectorization)
50/1
Results : speedup
51/1
Why ... ?
Instrumentation of the FastSysC code(program)
• Cycle Counters (RDTSC:Read Time Stamp Counter) :
1 The scheduler FastSysC transit time.2 The process time.
52/1
FastSysC transit time(without/with vectorization)
� � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� ���������
�������
�������
�������
�������
�������
� ������������ ��� ����������
��������������������������������������������������������������������������������������������������������������������������������������������� ������������������������������������������������������������������������������������������������������������������������������������������������
� � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� �� � � � �� �� ���������
�������
�������
�������
�������
�������
�������
�������
�������
�������
�������
�� ����������� ������������
��������������������������������������������������������������������������������������������������������������������������������������������� ������������������������������������������������������������������������������������������������������������������������������������������������
53/1
Conclusion
Results
• To address the need to improve the simulation speed, weproposed a developing modules methodology in a modularsimulator.
• This methodology is based on a new communication signalsprotocol .
The vectorial simulation improves scalability.
54/1
Results Discussion
Vectorization ...
• improves the speedup of the simulation time.
• it allows duplicate resources by limiting the overhead ofscheduler simulation time.
• can be used in conjunction with other techniques toimprove the speed as sampling techniques or reductionof test programs.
55/1
Results Discussion
Vectorization ...
56/1
Conclusion
ConclusionOur contribution aims to improve the simulation speed inmodular simulators, offering a simple and systematicdevelopment based on the vectorization of the simulatormodules.
57/1
Conclusion
Simplescalar is not a multi-core simulator
58/1
Conclusion
Simplescalar is not a multi-core simulator
59/1
In focus
Other idea ...
• VectorizationWe wish to compare the results of this methodology usingTTLM modeling (Timed Transaction Level Modeling).
60/1
Merci, Thank you, Tack
QUESTIONS ?
61/1
Back-up slides
Post-doc research work
• Instruction Level Parallelism : ILPGoal : understand the general structure of an execution andparallelism it offers.
• PerPi : A Tool to Measure Instruction Level Parallelism• http://kenny.univ-perp.fr/PerPi/• A Pin tool, an Intel free programmable tool,• computes the instructions dependency graph,• computes, for each instruction in the run, its instruction cycle in the ideal
machine,• Analysis of the structure of instruction-level parallelism,• Parallelism on loops,• Local and global parallelism,• Parallelism on function ”CALL”.
62/1
Back-up slides
Pin Tool
63/1
Back-up slides
TTLM
64/1
Back-up slides
SystemC and FastSysC
SystemC, Contains a scheduler which manages signals and directsthe process to start. It contains a sequential processes (sensitive tothe clock) and combinatorial process (sensitive to input ports).FastSysC, a mixture of static and dynamic scheduling to avoidunnecessary awakening processes : thus optimize the simulationengine.
65/1
Back-up slides
Monolithic
66/1
Back-up slides
Modular
67/1
Back-up slides
Parallel Simulation
68/1
Back-up slides
Sampling I
69/1
Back-up slides
Sampling II
70/1
Back-up slides
MiBench
71/1
Back-up slides
Use of OoOSim
72/1
Back-up slides
Stringsearch
73/1
Back-up slides
flight-trace simulation
74/1
Back-up slides
execution-driven simulation
75/1
Back-up slides
trace-driven simulation
76/1
Back-up slides
Unisim Example
77/1
Back-up slides
UNISIM History
78/1