Compactly Representing Parallel Program Executions

Compactly Representing Parallel Program ExecutionsCompactly Representing Parallel Program Executions

Ankit Goel Ankit Goel Abhik Roychoudhury Abhik Roychoudhury Tulika Tulika MitraMitra

National University of SingaporeNational University of Singapore

Path profilesPath profiles

Profiling a program’s executionProfiling a program’s execution– Count basedCount based– Path basedPath based

Count based profiles are more Count based profiles are more aggregateaggregate– # of# of execution of the program’s basic blocks execution of the program’s basic blocks– # of# of accesses of various memory locations accesses of various memory locations

Path based profiles are more Path based profiles are more accurateaccurate– SequenceSequence of basic blocks executed of basic blocks executed– SequenceSequence of memory locations accessed of memory locations accessed

Use Online compression to generate compact Use Online compression to generate compact path profiles.path profiles.

OrganizationOrganization

Compressed Path Profiles in Sequential Compressed Path Profiles in Sequential ProgramsPrograms

Parallel Program Path ProfilesParallel Program Path Profiles

Compression Efficiency and OverheadsCompression Efficiency and Overheads

Data race detection over path profilesData race detection over path profiles

Compressed Path - ExampleCompressed Path - Example

11

22

33

Uncompressed PathUncompressed Path

123123123123

Compressed Compressed RepresentationRepresentation

S S AA AA

A A 123 123

Control Flow GraphControl Flow Graph

Online Path CompressionOnline Path Compression

A program path is a A program path is a stringstring over a finite over a finite alphabetalphabet Alphabet decided by what we instrumentAlphabet decided by what we instrument

– Control flow (Basic Blocks executed)Control flow (Basic Blocks executed)– Data flow (Memory Locations accessed)Data flow (Memory Locations accessed)

A string s is represented by a Context Free A string s is represented by a Context Free Grammar Gs: Language of Gs is {s}Grammar Gs: Language of Gs is {s}

Construction of Gs is Construction of Gs is onlineonline and not post- and not post-mortemmortem– Start with trivial grammar & modify it for each Start with trivial grammar & modify it for each

symbolsymbol

No recursive rules (DAG representation)No recursive rules (DAG representation) Compression scheme – Nevill-Manning & Witten Compression scheme – Nevill-Manning & Witten

9797– Application to program paths – Larus 99Application to program paths – Larus 99

Online Compression in actionOnline Compression in action

Path Executed Compressed RepresentationPath Executed Compressed Representation

11 S -> 1S -> 1

1212 S -> 12S -> 12

123123 S -> 123S -> 12312311231 S -> 1231S -> 1231

1231212312 S ->S -> 12 12331212

S -> S -> AA33AA

A -> 12A -> 12

Online Compression in actionOnline Compression in action

Path Executed Compressed RepresentationPath Executed Compressed Representation

123123123123 S -> S -> A3A3A3A3

A -> 12A -> 12

S -> BBS -> BB

B -> B -> AA33

A -> 12A -> 12

S -> BBS -> BB

B -> 123B -> 123






What to represent ?What to represent ?

Control/data flow in each program threadControl/data flow in each program thread

Communication among threadsCommunication among threads– Synchronization (locks, barriers)Synchronization (locks, barriers)– Unsynchronized shared variable accessesUnsynchronized shared variable accesses

Too costly to observe/record order of all Too costly to observe/record order of all shared variable accessesshared variable accesses

We will representWe will represent– Compressed flow in each thread (Compressed flow in each thread (via Grammarvia Grammar))– Communication via synchronizations (How ?)Communication via synchronizations (How ?)

Synchronization Pattern (Locks)Synchronization Pattern (Locks)

locklock

unlockunlock

ComputComputee

locklock

unlockunlock

P1P1 P2P2 MemorMemoryy

Message Sequence Chart Message Sequence Chart (MSC)(MSC)

Pgm = P1 || Pgm = P1 || P2P2

Synchronization Pattern (Barrier)Synchronization Pattern (Barrier)

BlockedBlockedgogo

gogo

readyready

ComputComputee ComputComput

ee

P1P1 P2P2

Pgm = P1 || P2Pgm = P1 || P2

MemorMemoryy

readyready

Connection to MSCsConnection to MSCs

Partial Order of MSCPartial Order of MSC

unlockunlock

loclockk

Matches Matches Observed OrderingObserved Ordering

•Total order in each threadTotal order in each thread

•Ordering across threads Ordering across threads visible via synchronization visible via synchronization (msg. exchange) (msg. exchange)

All synchronization ops. form a total orderAll synchronization ops. form a total order

Th. 1Th. 1 Th. 2Th. 2 Shared Shared Mem.Mem.

A first cutA first cut

InstrumentInstrument each thread to observe local each thread to observe local control/data flow and global synch.control/data flow and global synch.

RepresentRepresent path profile of P1 || P2 path profile of P1 || P2– Each thread’s flow as a Grammar – (G1, G2)Each thread’s flow as a Grammar – (G1, G2)

Contains synch. ops. as well.Contains synch. ops. as well.– All synchronization ops. as a list.All synchronization ops. as a list.– Associate entries in this list to the occurrence Associate entries in this list to the occurrence

of synch. ops. in (G1,G2)of synch. ops. in (G1,G2)

How to How to navigatenavigate the path profile ? the path profile ?– Zoom in to a specific Zoom in to a specific lock—unlocklock—unlock segment of segment of

P1P1

Edge annotationsEdge annotations

aa

b (lock)b (lock)

c (unlock)c (unlock)

xx

b (lock)b (lock)

c (unlock)c (unlock)

yy

SS

AAaa

bb cc

xx

yy

Grammar for one threadGrammar for one thread

00 22

00 11

2244

Locating synch. operationsLocating synch. operations

SS

AAaa

bb cc

xx

yy00 22

00 11

2244

Locating the 3Locating the 3rdrd synchronization operation synchronization operation

Can find synch. segments by looking up global Can find synch. segments by looking up global list.list.

XX

YY}}

n synch ops.n synch ops.

nn

So farSo far

Control flow of each thread stored as a Control flow of each thread stored as a grammargrammar

Synchronization ops. form a global listSynchronization ops. form a global list

Grammar of each thread annotated with Grammar of each thread annotated with countscounts– Easy searching of synchronization operationsEasy searching of synchronization operations

What about shared data accesses ?What about shared data accesses ?

Sequence of memory locations accessed by a Sequence of memory locations accessed by a singlesingle LD/ST instruction can be compressed LD/ST instruction can be compressed– Use a Grammar representation for this seq. as Use a Grammar representation for this seq. as

wellwell

Further compressionFurther compression

Locations accessed by a memory operationLocations accessed by a memory operation– 10,14,18,22,26,54,58,62,66,70,9810,14,18,22,26,54,58,62,66,70,98

Online Compression of the string as grammarOnline Compression of the string as grammar– 10(1), 4(4), 28(1), 4(4), 28(1)10(1), 4(4), 28(1), 4(4), 28(1)– Difference representation + Run-length Difference representation + Run-length

encodingencoding

Useful for detecting regularity of array Useful for detecting regularity of array accessesaccesses– Sweep through an array: A run of constant diffs.Sweep through an array: A run of constant diffs.– Accessing a sub-grid of a multidimensional Accessing a sub-grid of a multidimensional

arrayarray






Any better than gzip ?Any better than gzip ?

0 %

2 %

4 %

6 %

8 %

10 %

12 %

FFT LU Mp3dWater SOR

Grammargzip

Compression % (2 Processors)Compression % (2 Processors)

Scalability of Compression Scalability of Compression

0 %1 %2 %3 %4 %5 %6 %7 %8 %9 %

10 %

FFT LU Mp3d Water SOR

2 Proc.4 Proc.8 Proc.

Compression % for our schemeCompression % for our scheme

Concerns about Timing Overheads Concerns about Timing Overheads Our scheme does not add substantial time Our scheme does not add substantial time

overhead over grammar based string overhead over grammar based string compressioncompression

Our experiments conducted using RSIM Our experiments conducted using RSIM – Tracing overheads can be higher in a real Tracing overheads can be higher in a real

multiprocessor multiprocessor – Can tracing distort program behavior ?Can tracing distort program behavior ?

Possible solutionPossible solution– Trace minimal number of operations in a Trace minimal number of operations in a

parallel program execution (Netzer 1993) to parallel program execution (Netzer 1993) to ensure deterministic replayensure deterministic replay

– Collect compressed path profile during replay.Collect compressed path profile during replay.






Apparent Data racesApparent Data races

locklock

unlockunlock

locklock

unlockunlock

locklock

Th. 1Th. 1 Th.2Th.2

unlockunlocklocklock

unlockunlock

Th.3Th.3 Mem.Mem.

•Last unlock in Th. 1 Last unlock in Th. 1 (first unlock)(first unlock)

•Next lock in Th. 1 Next lock in Th. 1 (second lock)(second lock)

•Locate root-to-leaf Locate root-to-leaf paths of these ops.paths of these ops.

•Tree rooted at the Tree rooted at the least common least common ancestor of these ops. ancestor of these ops.

No Decompression of the grammar of No Decompression of the grammar of Th. 1Th. 1

Data race artifactsData race artifacts

Sub := 1Sub := 1

A[1] := 0A[1] := 0

X := Sub;X := Sub;

Y := A[X] Y := A[X] (artifact)(artifact)

X decides which addr. is accessed in Y := X decides which addr. is accessed in Y := A[X]A[X]

X is set by Sub:= 1 which is also in a data X is set by Sub:= 1 which is also in a data race.race.

Detecting artifacts requires Detecting artifacts requires Data-flow Data-flow

Not captured by rd/wr sets in synch. Not captured by rd/wr sets in synch. segmentssegments

Captured in our compact path profiles.Captured in our compact path profiles.

SummarySummary

Compressed representation of the execution Compressed representation of the execution profile of shared memory parallel programsprofile of shared memory parallel programs– Control and shared data flow per threadControl and shared data flow per thread– Synchronization patterns across threadsSynchronization patterns across threads

Overall compression efficiency 0.25% -- 9.81% Overall compression efficiency 0.25% -- 9.81%

Compression efficiency scalable with Compression efficiency scalable with increasing number of processorsincreasing number of processors

Application: Post-mortem debugging such as Application: Post-mortem debugging such as detecting data racesdetecting data races

Other ApplicationsOther Applications

We do not capture actual order of We do not capture actual order of unsynchronizedunsynchronized shared memory accesses shared memory accesses across processorsacross processors

Can be useful in making architectural Can be useful in making architectural decisions such as choice of cache coherence decisions such as choice of cache coherence protocolprotocol

Sufficient to maintain [Netzer 1993]Sufficient to maintain [Netzer 1993]– transitive reduction of program order on each transitive reduction of program order on each

proc.proc.– shared variable conflict ordersshared variable conflict orders

Can we capture transitive reduction relation Can we capture transitive reduction relation via annotations of WPP edges?via annotations of WPP edges?

Documents

Compactly Representing Parallel Program Executions