36
Galois Performance Mario Mendez-Lojo Donald Nguyen

Galois Performance Mario Mendez-Lojo Donald Nguyen

  • View
    219

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Galois Performance Mario Mendez-Lojo Donald Nguyen

Galois Performance

Mario Mendez-LojoDonald Nguyen

Page 2: Galois Performance Mario Mendez-Lojo Donald Nguyen

2

Overview

• Galois system is a test bed to explore opts– Safe but not fast out of the box

• Important optimizations– Select least transactional overhead– Select right scheduling– Select appropriate data structure

• Quantify optimizations on applications

Page 3: Galois Performance Mario Mendez-Lojo Donald Nguyen

3

Algorithms

irregularalgorithms

topology

operator

ordering

morph

local computation

reader

general graph

grid

tree

unordered

ordered

1. Barnes-Hut

2. Delaunay Mesh Refinement

3. Preflow-push

Page 4: Galois Performance Mario Mendez-Lojo Donald Nguyen

4

MethodologyTh

read

s

IdleSerial GC

Time

Compute

• Abort Ratio: Aborted It/Total it

• GC options• UseParallelGC• UseParallelOldGC• NewRatio=1

Page 5: Galois Performance Mario Mendez-Lojo Donald Nguyen

5

Terms

• Base– Default scheduling, Default graph

• Serial– Galois classes => No concurrency control classes

• Speedup– Best mean performance of a serial variant

• Throughput– # Serial Iterations / time

Page 6: Galois Performance Mario Mendez-Lojo Donald Nguyen

6

Numbers

• Runtime– Last of 5 runs in same VM– Ignore time to read and construct initial graph

• Other statistics– Last of 5 runs

Page 7: Galois Performance Mario Mendez-Lojo Donald Nguyen

7

Test Environment

• 2 x Xeon X5570 (4 core, 2.93 GHz)• Java 1.6.0_0-b11• Linux 2.6.24-27 x86_64• 20GB heap size

Page 8: Galois Performance Mario Mendez-Lojo Donald Nguyen

8

BARNES-HUT

Most Distant Galaxy Candidates in the Hubble Ultra Deep Field

Page 9: Galois Performance Mario Mendez-Lojo Donald Nguyen

9

Barnes-Hut• N-body algorithm

– Oct-tree acceleration structure– Serial

• Tree build, center of mass, particle update

– Parallel• Force computation

• Structure– Reader on tree

• Variants– Splash2, Reader Galois

Page 10: Galois Performance Mario Mendez-Lojo Donald Nguyen

10

Reader Optimization

child = octree.getNeighbor(nn, 1);

child = octree.getNeighbor(nn, 1, MethodFlag.NONE);

Page 11: Galois Performance Mario Mendez-Lojo Donald Nguyen

11

ParaMeter Profile

Page 12: Galois Performance Mario Mendez-Lojo Donald Nguyen

12

Barnes-Hut Results

100,000 points, 1 time step

Best serial: baseSerial time: 10271 msBest // time: 1553 msBest speedup: 6.6X

Page 13: Galois Performance Mario Mendez-Lojo Donald Nguyen

13

Barnes-Hut Results

100,000 points, 1 time step

Best serial: baseSerial time: 10271 msBest // time: 1553 msBest speedup: 6.6X

Page 14: Galois Performance Mario Mendez-Lojo Donald Nguyen

14

Barnes-Hut Scalability

Page 15: Galois Performance Mario Mendez-Lojo Donald Nguyen

15

Page 16: Galois Performance Mario Mendez-Lojo Donald Nguyen

16

DELAUNAY MESH REFINEMENT

Page 17: Galois Performance Mario Mendez-Lojo Donald Nguyen

17

Delaunay Mesh Refinement

• Refine “bad” triangles– Maintained in worklist

• Structure– Cautious operator on graph

• Variants– Flag optimized, locallifo

base: Priority.defaultOrder()

local lifo: Priority.first(ChunkedFIFO.class). thenLocally(LIFO.class)

Page 18: Galois Performance Mario Mendez-Lojo Donald Nguyen

Cautious Optimization

mesh.contains(item);...

mesh.remove(preNodes.get(i));...

mesh.add(node);

mesh.contains(item, MethodFlag.CHECK_CONFLICT);...

mesh.remove(preNodes.get(i), MethodFlag.NONE);...

mesh.add(node, MethodFlag.NONE);

• No need to save undo info• Only check conflicts up to first write

Page 19: Galois Performance Mario Mendez-Lojo Donald Nguyen

19

LIFO Optimization

GaloisRuntime.foreach(...,

Priority.defaultOrder());

GaloisRuntime.foreach(...,

Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class));

Page 20: Galois Performance Mario Mendez-Lojo Donald Nguyen

20

ParaMeter Profile

Page 21: Galois Performance Mario Mendez-Lojo Donald Nguyen

21

DMR Results

0.5M triangles, 0.25M bad triangles

Best serial: locallifo.flagoptSerial time: 17002 msBest // time: 3745 msBest speedup: 4.5X

Page 22: Galois Performance Mario Mendez-Lojo Donald Nguyen

22

Page 23: Galois Performance Mario Mendez-Lojo Donald Nguyen

23

PREFLOW-PUSH

Page 24: Galois Performance Mario Mendez-Lojo Donald Nguyen

Preflow-push

• Max-flow algorithm– Nodes push flow downhill

• Structure– Cautious, local computation

• Variants– Flag optimized, local computation graph

base (discharge): Priority.first(Bucketed.class, numHeight+1, false, indexer). then(FIFO.class)

base (relabel): Priority.first(ChunkedFIFO.class, 8)

Page 25: Galois Performance Mario Mendez-Lojo Donald Nguyen

25

Local Computation Optimization

graph = ...

graph = ...b = new LocalComputationGraph.ObjectGraphBuilder();

graph = b.from(graph).create()

Page 26: Galois Performance Mario Mendez-Lojo Donald Nguyen

26

ParaMeter Profile

Page 27: Galois Performance Mario Mendez-Lojo Donald Nguyen

27

Preflow-push Results

From challenge problem (genmf-wide)14 linearly connected grids(194x194), 526,904 nodes, 2,586,020 edgeshttp://avglab.com/andrew/CATS/maxflow_synthetic.htm

C: 11450 msJava: 30234 ms

Best serial: lc.flagoptSerial time: 57121 msBest // time: 18242 msBest speedup: 3.1X

Page 28: Galois Performance Mario Mendez-Lojo Donald Nguyen

28

Preflow-push Scalability

Page 29: Galois Performance Mario Mendez-Lojo Donald Nguyen

29

Page 30: Galois Performance Mario Mendez-Lojo Donald Nguyen

30

What performance did we expect?Th

read

s

Time

IdleSerial GC//Compute Miss-Speculation

Measured Indirectly

Synchronization, …

Error

Page 31: Galois Performance Mario Mendez-Lojo Donald Nguyen

31

What performance did we expect?

• Naïve: r(x) = t1 / x

• Amdahl: r(x) = tp / x + ts

t1 = tp + ts

ts = tidle + tgc+ tserial

• Simple: r(x) = (tp (ix / i1)) / x + ts

Page 32: Galois Performance Mario Mendez-Lojo Donald Nguyen

32

Barnes-Hut

Page 33: Galois Performance Mario Mendez-Lojo Donald Nguyen

33

Delaunay Mesh Refinement

Page 34: Galois Performance Mario Mendez-Lojo Donald Nguyen

34

Preflow-push

Page 35: Galois Performance Mario Mendez-Lojo Donald Nguyen

35

Summary

• Many profitable optimizations– Selecting among method flags, worklists, graph

variants

• Open topics– Automation– Static, dynamic and performance analysis– Efficient ordered algorithms

Page 36: Galois Performance Mario Mendez-Lojo Donald Nguyen

36