Jake Adriaens jtadriaens@wisc Dan Gibson [email protected]

CS 838: Pervasive Parallelism

Profiling and Parallelization of the Multifacet GEMS Simulation Infrastructure

Jake [email protected]

Dan [email protected]

Instructor:Mark D. Hill

CS 838 2(C) 2005

Problem

• Simulation is (really) slow!– Simics alone runs at ~ 5 MIPS (fast!)

– Add Ruby ~ 50 KIPS

– Add Opal ~ 20 KIPS

• Fast simulations lead to faster evaluation of new ideas.

– Running many simulations in parallel (via Condor, for instance) is great for shrinking error bars, less useful for development.

• Fast simulations useful for educational purposes– Remember how long it took to simulate HW 5, HW 6?

» Simulations of long-running commercial workloads can take hours or DAYS, even on top-of-the-line hardware

CS 838 3(C) 2005

More Motivation – Why Parallelize?

Chips currently look like this:

A couple of cores

Memory & I/O Control

On-Chip Cache

Dual-Core AMD Opteron Die Photo From:

Microprocessor Report: Best Servers of 2004

CS 838 4(C) 2005

More Motivation – Why Parallelize?

Soon, chips may look like this:

CORE

CORE

CORE

CORE

CORE

CORE

CORE

CORE

Inte

rco

nn

ect

$ BANK

$ BANK

$ BANK

$ BANK

More cores!

Many more threads

The free lunch is over:

To get speedup out of multithreaded processors, programmers must implement parallel programs. (for now)

CS 838 5(C) 2005

Summary

• Good News: Found parallelism in GEMS– Ruby’s event queue often contains independent events

– Opal has some implicit parallelism, as it simulates many logically independent processors

• Bad News: Speedup potential is limited– In most cases, execution within Simics dominates execution time

– Amdahl’s Law suggests parallelization of GEMS will yield small increases in performance

• Good News: Discovered inefficiencies– The way GEMS uses Simics greatly affects Simics

– Isolated troublesome API calls and stalled processor effects

• Bad News: Simics isn’t very thread-friendly– No thread-safe functionality

– Calling Simics API requires a (costly) thread switch!

CS 838 6(C) 2005

Summary

• More Bad News: Parallelization of Ruby was not (entirely) successful

– Demonstrated little/no performance gain

– Suffers from deadlock

» We have a good excuse for this…

– Nondeterministic

» Fixable, minor effect

– Assumptions of non-concurrent execution

» Ready()/Operate() pairs

CS 838 7(C) 2005

What Next?

• Overview of Simics/Ruby/Opal– Lengthy example

• Profiling Experiments– Description of profiling experiments– Results

• Effects Ruby / Opal have on Simics– “Null” module experiments

• Parallel Ruby– …and its catastrophic failure

• Observations

• Conclusions

CS 838 8(C) 2005

Simics / Ruby / Opal Overview - 1

DetailedProcessor

Model

OpalSimics

Microbenchmarks

RandomTester

Dete

rmin

isti

c

Conte

nded lock

s

Tra

ce fl

ie

Graphics borrowed from GEMS ISCA-32 Tutorial Presentation, www.cs.wisc.edu/gems

Simics loadable modules

E1

E2 E3

E4 E5

CS 838 9(C) 2005

Opal + Ruby + Simics Operation


DetailedProcessor

Model

OpalSimics

E1

E2 E3

E4 E5

loop: add R2 R2 R3 beqz R2 loopadd R1 R2 R3sub R7 R8 R9ld R2 Ald R8 Bbeq R2 R8 eqcall my_func1eq: ld R2 A beq R2 R4 eqcall my_func2

my_func1: ld R8 C ret

Install Module

Install Module

Start Sim

F F F F

Instruction Fetches

I-Fetch Complete

I-Fetch Complete

CS 838 10(C) 2005



DetailedProcessor

Model

OpalSimics

E1

E2 E3

E4 E5



D D D D

FFFF

Instruction Fetches

API Calls for Decoding

CS 838 11(C) 2005



DetailedProcessor

Model

OpalSimics

E1

E2 E3

E4 E5



XX S

S XX

DDDD

FF

FF

C XX

XX WW

XXXXSXX

DD

DD

Step 1 Instr.

CS 838 12(C) 2005



DetailedProcessor

Model

OpalSimics

E1

E2 E3

E4 E5



C

C C

MMSWW

XXS

XXXX

Step 3 Instrs.

ld Ald B

CS 838 13(C) 2005



DetailedProcessor

Model

OpalSimics

E1

E2 E3

E4 E5



SSSWW

SS

MWW

ld CA=1B=1

CS 838 14(C) 2005



DetailedProcessor

Model

OpalSimics

E1

E2 E3

E4 E5



CCC

CXF

Step 4 Instrs.

I-Fetch call

Simple, Right?

CS 838 15(C) 2005

Finding Parallelism

• Lots of parallelism opportunities in the example!– Ruby/Opal (as described) could be run by separate threads!

– Ruby is a discrete event simulator…

» Can we apply Fujimoto’s PDES strategies directly?

• Places we found parallelism:– Ruby’s Event Queue (Experiment 1)

– Opal in general, on a per-processor basis (Experiment 2)

– Modular structure (not explored)

• But how much speedup can we gain through parallelism?

CS 838 16(C) 2005

Experiment 1: Ruby’s Event Queue

• Ruby is already a discrete event simulator (DES)– Making it a parallel DES (PDES) ala Fujimoto might be a way to

speed things up!

– Already has implicit lookahead of 1, due to existing event scheduling constraints.

• How many events are available for processing in a given cycle of the event queue?

– Too few could limit lookahead properties

• How long does a typical event execute?– Short events could make the queue itself a bottleneck

CS 838 17(C) 2005

Results 1 – Ruby’s Event Queue

Event Counts

Per

cent

age

of A

ll E

vent

s

Event Duration

Simulation Time

Time in Event Queue

Perf. Imp. w/ 8x Speedup

Infinite Speedup

74.0 s 15.4 s 1.22 1.26

Simics Time = SimTime – RubyTime = ~80%

CS 838 18(C) 2005

Experiment 2 – Opal’s Per-Processor Parallelism

• Opal simulates multiple logically independent processors

– Simulated processor independence => Parallelism

• Use one thread per simulated processor?– Raises work imbalance issues

» In practice, the work imbalance is tolerable

– Processors are only logically independent

» A common sequential bottleneck is shared between all Opal processors:

Simics

CS 838 19(C) 2005


Opal

SIM_continue

API Call

SIM_read_phys_memory API call

SIM_break_simulation API call

All other API calls

Best Parallel Opal Speedup <= 40%!

Execution time in Opal+Simics Simulation

CS 838 20(C) 2005


• Why is SIM_continue so slow?– Opal uses SIM_continue to logically progress the simulation by a

small number (1-4) instructions at a time.

– SIM_continue performs extensive start-up and tear-down optimization, expecting large (10,000+) step sizes

– Increasing Opal’s stepping size decreases total SIM_continue time significantly, but makes fine-grained simulation difficult

• Why is SIM_read_phys_memory so slow?– One call to SIM_read_phys_memory ~ 1us of execution time

» Reads from a proprietary-format compressed file

– Used by Opal once for every load instruction

» Loads are quite frequent!

CS 838 21(C) 2005

Experiment 3 – Simics API Calls

• Can there be more bad news?– Yes.

• How does Simics react to alien threads using its API?

Thread 5 returned from Simics.

patch PC: 0x1034e68 0x1034e64

*** ASSERTION ERROR:

in line 7530

of file 'v9_service_routines_1.c'

with RCSID '@(#) $Id: v9.sg,v 33.0.2.31 2004/10/08 12:23:07 am Exp $'

Please report this.

Simics will now self-signal an abort.

patch NPC: 0x1034e6c 0x1034e68

*** Simics getting shaky, switching to 'safe' mode.

*** Simics (thread 31) received an abort signal, probably an assertion.

*** thread 31 exiting.

Our Thread’s Output, having just returned from an API call

Something our thread does crashes one of the Simics threads!

CS 838 22(C) 2005

Experiment 3 – Simics API Calls

• Simics forbids calling the Simics API from alien threads– SIM_thread_safe_callback is the only mechanism to use

interface from threads

» Slow (see table)

» Non-blocking

» Must have released “Main Simics Thread” (MST)

Call Type First Call Latency (cycles)

Median Call Latency (cycles)

From MST 1765 100

From Spawned Thread

845565725 3212

Spawned Thread, MST Inf. Loop ∞ ∞Normal Func. Call 122 105

CS 838 23(C) 2005

Intermediate Conclusions

• Interactions with Simics limit our ability to exploit parallelism in Ruby and Opal

• Simics is fast without Ruby and/or Opal– Ruby and Opal in isolation are reasonably fast

– Ruby and Opal cause slowdowns in Simics

• The interactions between the GEMS modules and Simics result in performance loss

CS 838 24(C) 2005

Experiment 4 – “NULL” Modules

• To study Simics slowdown, we use “NULL” modules:– Empty, trivial modules that use interfaces similar to Ruby and Opal

– Modules contribute very little to runtime directly

– Effectively isolates Simics performance from module performance

• NullRUBY( X )– A simple memory timing model, using the same interface as Ruby

– Models a memory with a constant latency (X cycles per access)

• NullOPAL( IPC )– A trivial processor model, using a similar interface as Opal

– Steps Simics (with SIM_continue) by IPC instructions per cycle

CS 838 25(C) 2005


NULLRUBY(0) increases execution time by 2x-3x on average.

This is logically equivalent to having no timing model installed.

CS 838 26(C) 2005


Runtime increases ~linearly (or greater) as memory latency increases.

Processors stalled on memory requests are costly to simulate!

CS 838 27(C) 2005


Using SIM_continue with a stepping quanta of 10 is 3x-7x faster than the Opal default of 1!

CS 838 28(C) 2005


Ruby (with simulated memory latency of 300 cycles) slows Simics about as much as NullRUBY(200)

CS 838 29(C) 2005


In agreement with the pie chart, the runtime of SIM_continue accounts for about half of the Opal+Simics runtime

CS 838 30(C) 2005

“NULL” Module Observations

• Simulations are slow because of interactions between Simics, Ruby, and Opal

– T(Simics+Modules) != T(Simics) + T(Modules)

• Little or no speedup is possible from parallelizing Ruby and/or Opal with the current Simics interfaces

• Suggested improvements dramatically affect fidelity of simulations

– Increasing Opal’s step size reduces accuracy

– Optimizing Simics memory stall time requires coarse-grain simulation

CS 838 31(C) 2005

Parallelizing Ruby

• Despite overwhelming likelihood of failure, parallelize anyway!

• Obstacles:– Assumptions of non-concurrency

– Portions of Ruby are auto-generated

– Simics threading hurdles

– 48,059 lines of C++ in 312 separate files.

CS 838 32(C) 2005

Parallelizing Ruby

• Final implementation suffers from frequent deadlock– Fine-grained locking leads to many deadlock opportunities

– Can’t always acquire locks in same order:

» Lock ordering by meaning of protected object: Locks have different semantic meanings for different logical events (input vs. output queues)

» Lock ordering by address of the lock: May need to acquire a lock in order to determine which locks are needed

» Lock ordering by simulated chip topology: Need knowledge of “where” a particular event is occurring in simulated chip

– Coarse-grained locking has worse performance than a single thread

CS 838 33(C) 2005

Parallelizing Ruby

• Occasionally (for very short simulations), no deadlock occurs (soln: coarse-grain locks)

• Some non-determinism, but results are actually quite close to sequential version

• Almost no speedup

JBB-16P-10T( 4-thread )

Average Worst-Case Single-Thread

RubyCycles 2,633,492 2,633,941 2,633,470

L1 Misses 17,490 17,490 17,490

L2 Misses 27,366 27,367 27,366

Speedup ~+ 0.01% - 12% 0.0%

CS 838 34(C) 2005

Parallelizing Ruby

• Other challenges:– Ready()/Operate() pairs violate object-encapsulated

synchronization

» Ready() status may change between calls of Ready() and Operate()

– Fine-grained locking with object-encapsulated synchronization greatly simplified by Solaris-only lock recursion

» x86-64 pthread libraries on main simulation machines do not support lock recursion

– Unidentified sharing leads to difficult races

– Interactions with Simics require extreme synchronization

CS 838 35(C) 2005

Closing Remarks

• Improvements must be made to Ruby/Simics and Opal/Simics interfaces

• Parallelization of Ruby requires a substantial re-write of Ruby’s event queue and associated classes

– Incorporate knowledge of network topology to provide a lock acquisition order

– Replace “event” abstraction with “active object” abstraction, which is race-free.

• Parallel programming is hard– Chip manufacturers should be worried

CS 838 36(C) 2005

The End


DetailedProcessor

Model

OpalSimics

?

?

Documents

Jake Adriaens jtadriaens@wisc Dan Gibson [email protected]