47
TRASH DAY: COORDINATING GARBAGE COLLECTION IN DISTRIBUTED SYSTEMS Martin Maas* Tim Harris Krste Asanovic* John Kubiatowicz* *University of California, Berkeley Oracle Labs, Cambridge

T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

TRASH DAY: COORDINATING GARBAGECOLLECTION IN DISTRIBUTED SYSTEMS

Martin Maas* † Tim Harris† Krste Asanovic* John Kubiatowicz**University of California, Berkeley † Oracle Labs, Cambridge

Page 2: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

TRASH DAY: COORDINATING GARBAGECOLLECTION IN DISTRIBUTED SYSTEMS

Martin Maas* † Tim Harris† Krste Asanovic* John Kubiatowicz**University of California, Berkeley † Oracle Labs, Cambridge

Why you should care about GARBAGE

COLLECTION in Data Center Applications

Page 3: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Most Popular Languages 2015

3

5 out of the top 6languages popular in 2015 use Garbage Collection (GC)

Page 4: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Popular Frameworks using GC

4

Page 5: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

GC used by Cloud Companies

5

Page 6: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Why Managed Languages?

Productivity Gains

AvoidingBugs

Enable CertainOptimizations

6

[Targeting* Dynamic*Compilation*for* Embedded*Environments ,* Michael*Chen*and*Kunle Olukotun,*JVM’02]

Page 7: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

What is the Cost of GC?

7

• GC overhead workload and heap-size dependent, 5-20% on single machine

• In Distributed Applications, additional overheads emerge. Applications run a-cross independent runtime systems:

ISCA’12: Cao et al.

Node #3 Node #4

RuntimeRuntime

Node #1 Node #2

RuntimeRuntime

Page 8: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Two Example Workloads

Throughput-orientedBatch-style

Latency-sensitiveInteractive

8

Page 9: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

9

Page 10: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Spark Running PageRank

10

PageRank on 56 GB Wikipedia web graph

8-node cluster

Page 11: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Spark Running PageRank

11

Execution is divided into supersteps

8-node cluster

Page 12: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Spark Running PageRank

12

Execution is divided into superstepsEach superstep runs independent tasks

8-node cluster

Page 13: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Spark Running PageRank

13

Red – Synchronization at end of superstep

Page 14: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Spark Running PageRank

14

Green – Major GC Pause

Red – Synchronization at end of superstep

GC prevents superstepfrom completing

Page 15: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Spark Running PageRank

15

Execution stalls due to GC on other node

Different node

Page 16: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Impact on Superstep Times

White = No GC during superstepDark = One or more GCs (the darker the more GCs)

Nodes perform GC indifferent supersteps

16

Page 17: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Idea: Coordinate GC on different nodes

Trigger collection on all nodes at the when any one reaches a threshold

Policy: Stop-the-world Everywhere, STWE

17

Page 18: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Memory Occupancy over Time

Without STWE With STWE 18

Page 19: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Before

Impact of STWE PolicyNodes perform GC insame supersteps

19

Overall improvement in execution time (~15%)

Page 20: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

20

Page 21: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

YCSB Workload Generator

Cassandra with YCSB4-node Cassandra Cluster3-way replicated

Requests sent to arbitrary node; becomes coordinatorand contacts replicas to assemble quorum.

21

Page 22: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Blue – mean latency over a 10ms window

Grey Bars – minor GC on any node in the cluster

Query Latencies over Time

22

Page 23: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Blue – mean latency over a 10ms window

Grey Bars – minor GC on any node in the cluster

Query Latencies over Time

23

Page 24: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

1. Coordinator incurs GC during request2.Node required a quorum incurs GC3.Non-GC reasons (e.g., anti-entropy)

Sources of Stragglers

24

Page 25: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

1. Coordinator incurs GC during request2.Node required a quorum incurs GC3.Non-GC reasons (e.g., anti-entropy)

Sources of Stragglers

25

Page 26: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

GC-aware Work DistributionSteer client requests to Cassandra

nodes, avoiding ones that will need a minor collection soonPolicy: Request Steering, STEER

26

Page 27: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

YCSB Workload Generator

Steering Cassandra Requests

Monitor memory on all nodes

If one node is close to GC, send to other nodes instead

27

Page 28: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

YCSB Workload Generator

Steering Cassandra Requests

Monitor memory on all nodes

If one node is close to GC, send to other nodes instead>80% full

28

Page 29: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

YCSB Workload Generator

Steering Cassandra Requests

Monitor memory on all nodes

If one node is close to GC, send to other nodes instead>80% full

29

Page 30: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Blue – without steeringRed – with steering

Impact of Request SteeringReads Updates

30

99.9 percentile: 3.3 ms -> 1.6 msWorst case: 83 ms -> 19 ms

Page 31: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Are These Problems Common?

31

• GC problems affect a large number of applications

• Have existed since dawn of warehouse-scale computing

• Current surge of interest in both industry and academia (6 new papers in last 4 mo.)

SOSP ’01, Welsh et al.

Page 32: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Common Solutions

Rewrite atlower level

Respond toGC Pauses

ConcurrentCollectors

32

Cinnober on: GC pause-free Java applications through orchestrated memory management

Cinnober’s latest innovation captures the best of two worlds in a single state-of-the-art solution: a functionality-rich trading system with consistently low latency.

Predictable low latency

Transaction flow example1. Incoming transaction

A request is received by the primary node’s Ultra commu-nication framework, providing pause-free processing. The transaction is assigned a sequence number and sent to the primary and standby servers for execution.

2. Primary request handling

A shared memory transport mechanism sends the transac-tion to the primary server. Expected latency for shared mem-ory communication is on the order of 300 nanoseconds.

3. Standby request handling

The incoming transaction is replicated to the correspond-ing communication framework process on the standby side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected latency for RDMA communication is on the order of 2-3 microseconds, depending on message size.

The communication framework sends an acknowledge-ment message back to the primary node to verify that the standby has received it.

4. Business logic, responses and broadcasts

The primary and standby servers both execute the request in parallel and provide identical transaction responses and broadcasts. The responses are sent via the shared memory transport to the Ultra communication framework. The standby side sends the response to the primary side’s Ultra communication framework, again using RDMA communica-tion.

5. Primary response handling

The primary side sends out the response as soon as it is received from either the primary or the standby, whichever arrives first.

This is the key to allowing orchestrated JVM pauses in the primary and standby servers.

Orchestration of JVM pausesThe Ultra communication framework handles the orches-tration of JVM pauses.

Since the primary and standby servers execute the same code with essentially the same memory turnover behavior, the orchestration mechanism does not need to be overly complex. Measuring memory turnover at maximum trans-action load, the orchestrator only needs to keep track of current memory usage and calculate when a garbage col-lection should occur. The initial garbage collection for one of the servers is requested so that future garbage collections will occur with optimal time spacing.

Next stepCinnober will publish an extensive white paper about or-chestrated memory management in TRADExpress in con-junction with the coming version release that is scheduled for the second quarter of 2014.

• Rewrite in C/C++• Use non-idiomatic

language constructs

✘ Lose language advantages, lack

of generality✘ Substantial effort to adopt

✘ Performance overheads, still

have pauses

Page 33: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Common Solutions

Rewrite atlower level

Respond toGC Pauses

Concurrentcollectors

33

Cinnober on: GC pause-free Java applications through orchestrated memory management

Cinnober’s latest innovation captures the best of two worlds in a single state-of-the-art solution: a functionality-rich trading system with consistently low latency.

Predictable low latency

Transaction flow example1. Incoming transaction

A request is received by the primary node’s Ultra commu-nication framework, providing pause-free processing. The transaction is assigned a sequence number and sent to the primary and standby servers for execution.

2. Primary request handling

A shared memory transport mechanism sends the transac-tion to the primary server. Expected latency for shared mem-ory communication is on the order of 300 nanoseconds.

3. Standby request handling

The incoming transaction is replicated to the correspond-ing communication framework process on the standby side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected latency for RDMA communication is on the order of 2-3 microseconds, depending on message size.

The communication framework sends an acknowledge-ment message back to the primary node to verify that the standby has received it.

4. Business logic, responses and broadcasts

The primary and standby servers both execute the request in parallel and provide identical transaction responses and broadcasts. The responses are sent via the shared memory transport to the Ultra communication framework. The standby side sends the response to the primary side’s Ultra communication framework, again using RDMA communica-tion.

5. Primary response handling

The primary side sends out the response as soon as it is received from either the primary or the standby, whichever arrives first.

This is the key to allowing orchestrated JVM pauses in the primary and standby servers.

Orchestration of JVM pausesThe Ultra communication framework handles the orches-tration of JVM pauses.

Since the primary and standby servers execute the same code with essentially the same memory turnover behavior, the orchestration mechanism does not need to be overly complex. Measuring memory turnover at maximum trans-action load, the orchestrator only needs to keep track of current memory usage and calculate when a garbage col-lection should occur. The initial garbage collection for one of the servers is requested so that future garbage collections will occur with optimal time spacing.

Next stepCinnober will publish an extensive white paper about or-chestrated memory management in TRADExpress in con-junction with the coming version release that is scheduled for the second quarter of 2014.

• Rewrite in C/C++• Use non-idiomatic

language constructs

✘ Lose language advantages

✘ Substantial effort to adopt

✘ Performance Overheads

No general, widely adopted solution!

Page 34: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

The problem is not GC, it islanguage runtime system coordination

34

Page 35: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Current ApproachLanguage Runtime Systems are

completely independent (not just GC)Cluster Scheduler

App #3

Commodity OS

App #4

RuntimeRuntime

App #1

Commodity OS

App #2

RuntimeRuntime

35

Page 36: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Current Approach

Cluster Scheduler

App #3

Commodity OS

App #4

RuntimeRuntime

App #1

Commodity OS

App #2

RuntimeRuntimeIntra-nodeInterference

Language Runtime Systems are completely independent (not just GC)

36

Page 37: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Cluster Scheduler

App #3

Commodity OS

App #4

RuntimeRuntime

App #1

Commodity OS

App #2

RuntimeRuntimeIntra-nodeInterference

Lack ofCoordination

Current Approach

37

Language Runtime Systems are completely independent (not just GC)

Page 38: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Cluster Scheduler

App #3

Commodity OS

App #4

RuntimeRuntime

App #1

Commodity OS

App #2

RuntimeRuntimeIntra-nodeInterference

Lack ofCoordination

RedundancyJIT JIT

Current Approach

38

Language Runtime Systems are completely independent (not just GC)

Page 39: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Cluster Scheduler

App #3

Commodity OS

App #4

RuntimeRuntime

Commodity OS

App

RTIntra-nodeInterference

Lack ofCoordination

RedundancyElasticityApp

RT

App

RT

Current Approach

39

Language Runtime Systems are completely independent (not just GC)

Page 40: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Holistic Runtime SystemsApply the Distributed OS Ideas to design a Distributed Language Runtime System

Cluster Scheduler

App #3 App #4

RuntimeRuntime

App #1 App #2

RuntimeRuntime

40

Page 41: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Holistic Runtime SystemsApply the Distributed OS Ideas to design a Distributed Language Runtime System

Cluster Scheduler

App #3 App #4App #1 App #2

41

Distributed Runtime HolisticRuntimeSystemRuntimeRuntimeRuntimeRuntime

Page 42: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Our Prototype• Coordinated runtime decisions using

a feedback loop with dist. consensus• Configured by Policy (written in DSL)• Drop-in replacement for Java VM• No modifying of application required

42

Page 43: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

System Design

Hotspot JVM Hotspot JVM

Application Node 0 Application Node 1

Memo

ry-Oc

cupa

ncy,

State

Plan,Reconfiguration,State updates

User-suppliedPolicy

Holis

tic Ru

ntim

e Sys

tem

Monitor Monitor

State State

43

Page 44: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Why This Approach?• Easy to adopt (just pick policy, almost

no configuration required)• Minimally invasive to runtime system• Expressive (can express a large range

of GC coordination policies)44

Page 45: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Our plan is to make the system available as open source

45

Page 46: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Would you use it?

46

Page 47: T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected

Thank you! Any Questions?

Martin Maas, Tim Harris, Krste Asanovic, John Kubiatowicz{maas,krste,kubitron}@eecs.berkeley.edu [email protected]

Work started while at Oracle Labs, Cambridge.