Upload
herman-poole
View
31
Download
0
Tags:
Embed Size (px)
DESCRIPTION
CS 838: Pervasive Parallelism Profiling and Parallelization of the Multifacet GEMS Simulation Infrastructure. Instructor: Mark D. Hill. Jake Adriaens [email protected] Dan Gibson [email protected]. Problem. Simulation is (really) slow! Simics alone runs at ~ 5 MIPS (fast!) - PowerPoint PPT Presentation
Citation preview
CS 838: Pervasive Parallelism
Profiling and Parallelization of the Multifacet GEMS Simulation Infrastructure
Jake [email protected]
Instructor:Mark D. Hill
CS 838 2(C) 2005
Problem
• Simulation is (really) slow!– Simics alone runs at ~ 5 MIPS (fast!)
– Add Ruby ~ 50 KIPS
– Add Opal ~ 20 KIPS
• Fast simulations lead to faster evaluation of new ideas.
– Running many simulations in parallel (via Condor, for instance) is great for shrinking error bars, less useful for development.
• Fast simulations useful for educational purposes– Remember how long it took to simulate HW 5, HW 6?
» Simulations of long-running commercial workloads can take hours or DAYS, even on top-of-the-line hardware
CS 838 3(C) 2005
More Motivation – Why Parallelize?
Chips currently look like this:
A couple of cores
Memory & I/O Control
On-Chip Cache
Dual-Core AMD Opteron Die Photo From:
Microprocessor Report: Best Servers of 2004
CS 838 4(C) 2005
More Motivation – Why Parallelize?
Soon, chips may look like this:
CORE
CORE
CORE
CORE
CORE
CORE
CORE
CORE
Inte
rco
nn
ect
$ BANK
$ BANK
$ BANK
$ BANK
More cores!
Many more threads
The free lunch is over:
To get speedup out of multithreaded processors, programmers must implement parallel programs. (for now)
CS 838 5(C) 2005
Summary
• Good News: Found parallelism in GEMS– Ruby’s event queue often contains independent events
– Opal has some implicit parallelism, as it simulates many logically independent processors
• Bad News: Speedup potential is limited– In most cases, execution within Simics dominates execution time
– Amdahl’s Law suggests parallelization of GEMS will yield small increases in performance
• Good News: Discovered inefficiencies– The way GEMS uses Simics greatly affects Simics
– Isolated troublesome API calls and stalled processor effects
• Bad News: Simics isn’t very thread-friendly– No thread-safe functionality
– Calling Simics API requires a (costly) thread switch!
CS 838 6(C) 2005
Summary
• More Bad News: Parallelization of Ruby was not (entirely) successful
– Demonstrated little/no performance gain
– Suffers from deadlock
» We have a good excuse for this…
– Nondeterministic
» Fixable, minor effect
– Assumptions of non-concurrent execution
» Ready()/Operate() pairs
CS 838 7(C) 2005
What Next?
• Overview of Simics/Ruby/Opal– Lengthy example
• Profiling Experiments– Description of profiling experiments– Results
• Effects Ruby / Opal have on Simics– “Null” module experiments
• Parallel Ruby– …and its catastrophic failure
• Observations
• Conclusions
CS 838 8(C) 2005
Simics / Ruby / Opal Overview - 1
DetailedProcessor
Model
OpalSimics
Microbenchmarks
RandomTester
Dete
rmin
isti
c
Conte
nded lock
s
Tra
ce fl
ie
Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems
Simics loadable modules
E1
E2 E3
E4 E5
CS 838 9(C) 2005
Opal + Ruby + Simics Operation
Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems
DetailedProcessor
Model
OpalSimics
E1
E2 E3
E4 E5
loop: add R2 R2 R3 beqz R2 loopadd R1 R2 R3sub R7 R8 R9ld R2 Ald R8 Bbeq R2 R8 eqcall my_func1eq: ld R2 A beq R2 R4 eqcall my_func2
my_func1: ld R8 C ret
Install Module
Install Module
Start Sim
F F F F
Instruction Fetches
I-Fetch Complete
I-Fetch Complete
CS 838 10(C) 2005
Opal + Ruby + Simics Operation
Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems
DetailedProcessor
Model
OpalSimics
E1
E2 E3
E4 E5
loop: add R2 R2 R3 beqz R2 loopadd R1 R2 R3sub R7 R8 R9ld R2 Ald R8 Bbeq R2 R8 eqcall my_func1eq: ld R2 A beq R2 R4 eqcall my_func2
my_func1: ld R8 C ret
D D D D
FFFF
Instruction Fetches
API Calls for Decoding
CS 838 11(C) 2005
Opal + Ruby + Simics Operation
Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems
DetailedProcessor
Model
OpalSimics
E1
E2 E3
E4 E5
loop: add R2 R2 R3 beqz R2 loopadd R1 R2 R3sub R7 R8 R9ld R2 Ald R8 Bbeq R2 R8 eqcall my_func1eq: ld R2 A beq R2 R4 eqcall my_func2
my_func1: ld R8 C ret
XX S
S XX
DDDD
FF
FF
C XX
XX WW
XXXXSXX
DD
DD
Step 1 Instr.
CS 838 12(C) 2005
Opal + Ruby + Simics Operation
Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems
DetailedProcessor
Model
OpalSimics
E1
E2 E3
E4 E5
loop: add R2 R2 R3 beqz R2 loopadd R1 R2 R3sub R7 R8 R9ld R2 Ald R8 Bbeq R2 R8 eqcall my_func1eq: ld R2 A beq R2 R4 eqcall my_func2
my_func1: ld R8 C ret
C
C C
MMSWW
XXS
XXXX
Step 3 Instrs.
ld Ald B
CS 838 13(C) 2005
Opal + Ruby + Simics Operation
Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems
DetailedProcessor
Model
OpalSimics
E1
E2 E3
E4 E5
loop: add R2 R2 R3 beqz R2 loopadd R1 R2 R3sub R7 R8 R9ld R2 Ald R8 Bbeq R2 R8 eqcall my_func1eq: ld R2 A beq R2 R4 eqcall my_func2
my_func1: ld R8 C ret
SSSWW
SS
MWW
ld CA=1B=1
CS 838 14(C) 2005
Opal + Ruby + Simics Operation
Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems
DetailedProcessor
Model
OpalSimics
E1
E2 E3
E4 E5
loop: add R2 R2 R3 beqz R2 loopadd R1 R2 R3sub R7 R8 R9ld R2 Ald R8 Bbeq R2 R8 eqcall my_func1eq: ld R2 A beq R2 R4 eqcall my_func2
my_func1: ld R8 C ret
CCC
CXF
Step 4 Instrs.
I-Fetch call
Simple, Right?
CS 838 15(C) 2005
Finding Parallelism
• Lots of parallelism opportunities in the example!– Ruby/Opal (as described) could be run by separate threads!
– Ruby is a discrete event simulator…
» Can we apply Fujimoto’s PDES strategies directly?
• Places we found parallelism:– Ruby’s Event Queue (Experiment 1)
– Opal in general, on a per-processor basis (Experiment 2)
– Modular structure (not explored)
• But how much speedup can we gain through parallelism?
CS 838 16(C) 2005
Experiment 1: Ruby’s Event Queue
• Ruby is already a discrete event simulator (DES)– Making it a parallel DES (PDES) ala Fujimoto might be a way to
speed things up!
– Already has implicit lookahead of 1, due to existing event scheduling constraints.
• How many events are available for processing in a given cycle of the event queue?
– Too few could limit lookahead properties
• How long does a typical event execute?– Short events could make the queue itself a bottleneck
CS 838 17(C) 2005
Results 1 – Ruby’s Event Queue
Event Counts
Per
cent
age
of A
ll E
vent
s
Event Duration
Simulation Time
Time in Event Queue
Perf. Imp. w/ 8x Speedup
Infinite Speedup
74.0 s 15.4 s 1.22 1.26
Simics Time = SimTime – RubyTime = ~80%
CS 838 18(C) 2005
Experiment 2 – Opal’s Per-Processor Parallelism
• Opal simulates multiple logically independent processors
– Simulated processor independence => Parallelism
• Use one thread per simulated processor?– Raises work imbalance issues
» In practice, the work imbalance is tolerable
– Processors are only logically independent
» A common sequential bottleneck is shared between all Opal processors:
Simics
CS 838 19(C) 2005
Experiment 2 – Opal’s Per-Processor Parallelism
Opal
SIM_continue
API Call
SIM_read_phys_memory API call
SIM_break_simulation API call
All other API calls
Best Parallel Opal Speedup <= 40%!
Execution time in Opal+Simics Simulation
CS 838 20(C) 2005
Experiment 2 – Opal’s Per-Processor Parallelism
• Why is SIM_continue so slow?– Opal uses SIM_continue to logically progress the simulation by a
small number (1-4) instructions at a time.
– SIM_continue performs extensive start-up and tear-down optimization, expecting large (10,000+) step sizes
– Increasing Opal’s stepping size decreases total SIM_continue time significantly, but makes fine-grained simulation difficult
• Why is SIM_read_phys_memory so slow?– One call to SIM_read_phys_memory ~ 1us of execution time
» Reads from a proprietary-format compressed file
– Used by Opal once for every load instruction
» Loads are quite frequent!
CS 838 21(C) 2005
Experiment 3 – Simics API Calls
• Can there be more bad news?– Yes.
• How does Simics react to alien threads using its API?
Thread 5 returned from Simics.
patch PC: 0x1034e68 0x1034e64
*** ASSERTION ERROR:
in line 7530
of file 'v9_service_routines_1.c'
with RCSID '@(#) $Id: v9.sg,v 33.0.2.31 2004/10/08 12:23:07 am Exp $'
Please report this.
Simics will now self-signal an abort.
patch NPC: 0x1034e6c 0x1034e68
*** Simics getting shaky, switching to 'safe' mode.
*** Simics (thread 31) received an abort signal, probably an assertion.
*** thread 31 exiting.
Our Thread’s Output, having just returned from an API call
Something our thread does crashes one of the Simics threads!
CS 838 22(C) 2005
Experiment 3 – Simics API Calls
• Simics forbids calling the Simics API from alien threads– SIM_thread_safe_callback is the only mechanism to use
interface from threads
» Slow (see table)
» Non-blocking
» Must have released “Main Simics Thread” (MST)
Call Type First Call Latency (cycles)
Median Call Latency (cycles)
From MST 1765 100
From Spawned Thread
845565725 3212
Spawned Thread, MST Inf. Loop ∞ ∞Normal Func. Call 122 105
CS 838 23(C) 2005
Intermediate Conclusions
• Interactions with Simics limit our ability to exploit parallelism in Ruby and Opal
• Simics is fast without Ruby and/or Opal– Ruby and Opal in isolation are reasonably fast
– Ruby and Opal cause slowdowns in Simics
• The interactions between the GEMS modules and Simics result in performance loss
CS 838 24(C) 2005
Experiment 4 – “NULL” Modules
• To study Simics slowdown, we use “NULL” modules:– Empty, trivial modules that use interfaces similar to Ruby and Opal
– Modules contribute very little to runtime directly
– Effectively isolates Simics performance from module performance
• NullRUBY( X )– A simple memory timing model, using the same interface as Ruby
– Models a memory with a constant latency (X cycles per access)
• NullOPAL( IPC )– A trivial processor model, using a similar interface as Opal
– Steps Simics (with SIM_continue) by IPC instructions per cycle
CS 838 25(C) 2005
Experiment 4 – “NULL” Modules
NULLRUBY(0) increases execution time by 2x-3x on average.
This is logically equivalent to having no timing model installed.
CS 838 26(C) 2005
Experiment 4 – “NULL” Modules
Runtime increases ~linearly (or greater) as memory latency increases.
Processors stalled on memory requests are costly to simulate!
CS 838 27(C) 2005
Experiment 4 – “NULL” Modules
Using SIM_continue with a stepping quanta of 10 is 3x-7x faster than the Opal default of 1!
CS 838 28(C) 2005
Experiment 4 – “NULL” Modules
Ruby (with simulated memory latency of 300 cycles) slows Simics about as much as NullRUBY(200)
CS 838 29(C) 2005
Experiment 4 – “NULL” Modules
In agreement with the pie chart, the runtime of SIM_continue accounts for about half of the Opal+Simics runtime
CS 838 30(C) 2005
“NULL” Module Observations
• Simulations are slow because of interactions between Simics, Ruby, and Opal
– T(Simics+Modules) != T(Simics) + T(Modules)
• Little or no speedup is possible from parallelizing Ruby and/or Opal with the current Simics interfaces
• Suggested improvements dramatically affect fidelity of simulations
– Increasing Opal’s step size reduces accuracy
– Optimizing Simics memory stall time requires coarse-grain simulation
CS 838 31(C) 2005
Parallelizing Ruby
• Despite overwhelming likelihood of failure, parallelize anyway!
• Obstacles:– Assumptions of non-concurrency
– Portions of Ruby are auto-generated
– Simics threading hurdles
– 48,059 lines of C++ in 312 separate files.
CS 838 32(C) 2005
Parallelizing Ruby
• Final implementation suffers from frequent deadlock– Fine-grained locking leads to many deadlock opportunities
– Can’t always acquire locks in same order:
» Lock ordering by meaning of protected object: Locks have different semantic meanings for different logical events (input vs. output queues)
» Lock ordering by address of the lock: May need to acquire a lock in order to determine which locks are needed
» Lock ordering by simulated chip topology: Need knowledge of “where” a particular event is occurring in simulated chip
– Coarse-grained locking has worse performance than a single thread
CS 838 33(C) 2005
Parallelizing Ruby
• Occasionally (for very short simulations), no deadlock occurs (soln: coarse-grain locks)
• Some non-determinism, but results are actually quite close to sequential version
• Almost no speedup
JBB-16P-10T( 4-thread )
Average Worst-Case Single-Thread
RubyCycles 2,633,492 2,633,941 2,633,470
L1 Misses 17,490 17,490 17,490
L2 Misses 27,366 27,367 27,366
Speedup ~+ 0.01% - 12% 0.0%
CS 838 34(C) 2005
Parallelizing Ruby
• Other challenges:– Ready()/Operate() pairs violate object-encapsulated
synchronization
» Ready() status may change between calls of Ready() and Operate()
– Fine-grained locking with object-encapsulated synchronization greatly simplified by Solaris-only lock recursion
» x86-64 pthread libraries on main simulation machines do not support lock recursion
– Unidentified sharing leads to difficult races
– Interactions with Simics require extreme synchronization
CS 838 35(C) 2005
Closing Remarks
• Improvements must be made to Ruby/Simics and Opal/Simics interfaces
• Parallelization of Ruby requires a substantial re-write of Ruby’s event queue and associated classes
– Incorporate knowledge of network topology to provide a lock acquisition order
– Replace “event” abstraction with “active object” abstraction, which is race-free.
• Parallel programming is hard– Chip manufacturers should be worried
CS 838 36(C) 2005
The End
Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems
DetailedProcessor
Model
OpalSimics
?
?