T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and...

TRASH DAY: COORDINATING GARBAGECOLLECTION IN DISTRIBUTED SYSTEMS

Martin Maas* † Tim Harris† Krste Asanovic* John Kubiatowicz**University of California, Berkeley † Oracle Labs, Cambridge

TRASH DAY: COORDINATING GARBAGECOLLECTION IN DISTRIBUTED SYSTEMS

Martin Maas* † Tim Harris† Krste Asanovic* John Kubiatowicz**University of California, Berkeley † Oracle Labs, Cambridge

Why you should care about GARBAGE

COLLECTION in Data Center Applications

Most Popular Languages 2015

5 out of the top 6languages popular in 2015 use Garbage Collection (GC)

Popular Frameworks using GC

GC used by Cloud Companies

Why Managed Languages?

Productivity Gains

AvoidingBugs

Enable CertainOptimizations

[Targeting* Dynamic*Compilation*for* Embedded*Environments ,* Michael*Chen*and*Kunle Olukotun,*JVM’02]

What is the Cost of GC?

• GC overhead workload and heap-size dependent, 5-20% on single machine

• In Distributed Applications, additional overheads emerge. Applications run a-cross independent runtime systems:

ISCA’12: Cao et al.

Node #3 Node #4

RuntimeRuntime

Node #1 Node #2

RuntimeRuntime

Two Example Workloads

Throughput-orientedBatch-style

Latency-sensitiveInteractive

Spark Running PageRank

PageRank on 56 GB Wikipedia web graph

8-node cluster

Execution is divided into supersteps

8-node cluster

Execution is divided into superstepsEach superstep runs independent tasks

8-node cluster

Red – Synchronization at end of superstep

Green – Major GC Pause

Red – Synchronization at end of superstep

GC prevents superstepfrom completing

Execution stalls due to GC on other node

Different node

Impact on Superstep Times

White = No GC during superstepDark = One or more GCs (the darker the more GCs)

Nodes perform GC indifferent supersteps

Idea: Coordinate GC on different nodes

Trigger collection on all nodes at the when any one reaches a threshold

Policy: Stop-the-world Everywhere, STWE

Memory Occupancy over Time

Without STWE With STWE 18

Before

Impact of STWE PolicyNodes perform GC insame supersteps

Overall improvement in execution time (~15%)

YCSB Workload Generator

Cassandra with YCSB4-node Cassandra Cluster3-way replicated

Requests sent to arbitrary node; becomes coordinatorand contacts replicas to assemble quorum.

Blue – mean latency over a 10ms window

Grey Bars – minor GC on any node in the cluster

Query Latencies over Time

Blue – mean latency over a 10ms window

Grey Bars – minor GC on any node in the cluster

Query Latencies over Time

1. Coordinator incurs GC during request2.Node required a quorum incurs GC3.Non-GC reasons (e.g., anti-entropy)

Sources of Stragglers

1. Coordinator incurs GC during request2.Node required a quorum incurs GC3.Non-GC reasons (e.g., anti-entropy)

Sources of Stragglers

GC-aware Work DistributionSteer client requests to Cassandra

nodes, avoiding ones that will need a minor collection soonPolicy: Request Steering, STEER

Steering Cassandra Requests

Monitor memory on all nodes

If one node is close to GC, send to other nodes instead

If one node is close to GC, send to other nodes instead>80% full

Blue – without steeringRed – with steering

Impact of Request SteeringReads Updates

99.9 percentile: 3.3 ms -> 1.6 msWorst case: 83 ms -> 19 ms

Are These Problems Common?

• GC problems affect a large number of applications

• Have existed since dawn of warehouse-scale computing

• Current surge of interest in both industry and academia (6 new papers in last 4 mo.)

SOSP ’01, Welsh et al.

Common Solutions

Rewrite atlower level

Respond toGC Pauses

ConcurrentCollectors

Cinnober on: GC pause-free Java applications through orchestrated memory management

Cinnober’s latest innovation captures the best of two worlds in a single state-of-the-art solution: a functionality-rich trading system with consistently low latency.

Predictable low latency

Transaction flow example1. Incoming transaction

A request is received by the primary node’s Ultra commu-nication framework, providing pause-free processing. The transaction is assigned a sequence number and sent to the primary and standby servers for execution.

2. Primary request handling

A shared memory transport mechanism sends the transac-tion to the primary server. Expected latency for shared mem-ory communication is on the order of 300 nanoseconds.

3. Standby request handling

The incoming transaction is replicated to the correspond-ing communication framework process on the standby side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected latency for RDMA communication is on the order of 2-3 microseconds, depending on message size.

The communication framework sends an acknowledge-ment message back to the primary node to verify that the standby has received it.

4. Business logic, responses and broadcasts

The primary and standby servers both execute the request in parallel and provide identical transaction responses and broadcasts. The responses are sent via the shared memory transport to the Ultra communication framework. The standby side sends the response to the primary side’s Ultra communication framework, again using RDMA communica-tion.

5. Primary response handling

The primary side sends out the response as soon as it is received from either the primary or the standby, whichever arrives first.

This is the key to allowing orchestrated JVM pauses in the primary and standby servers.

Orchestration of JVM pausesThe Ultra communication framework handles the orches-tration of JVM pauses.

Since the primary and standby servers execute the same code with essentially the same memory turnover behavior, the orchestration mechanism does not need to be overly complex. Measuring memory turnover at maximum trans-action load, the orchestrator only needs to keep track of current memory usage and calculate when a garbage col-lection should occur. The initial garbage collection for one of the servers is requested so that future garbage collections will occur with optimal time spacing.

Next stepCinnober will publish an extensive white paper about or-chestrated memory management in TRADExpress in con-junction with the coming version release that is scheduled for the second quarter of 2014.

• Rewrite in C/C++• Use non-idiomatic

language constructs

✘ Lose language advantages, lack

of generality✘ Substantial effort to adopt

✘ Performance overheads, still

have pauses

Common Solutions

Rewrite atlower level

Respond toGC Pauses

Concurrentcollectors

Cinnober on: GC pause-free Java applications through orchestrated memory management

Cinnober’s latest innovation captures the best of two worlds in a single state-of-the-art solution: a functionality-rich trading system with consistently low latency.

Predictable low latency

Transaction flow example1. Incoming transaction

A request is received by the primary node’s Ultra commu-nication framework, providing pause-free processing. The transaction is assigned a sequence number and sent to the primary and standby servers for execution.

2. Primary request handling

A shared memory transport mechanism sends the transac-tion to the primary server. Expected latency for shared mem-ory communication is on the order of 300 nanoseconds.

3. Standby request handling

The incoming transaction is replicated to the correspond-ing communication framework process on the standby side using RDMA (Remote Direct Memory Access) and from there to the standby execution server using shared memory transport. Expected latency for RDMA communication is on the order of 2-3 microseconds, depending on message size.

The communication framework sends an acknowledge-ment message back to the primary node to verify that the standby has received it.

4. Business logic, responses and broadcasts

The primary and standby servers both execute the request in parallel and provide identical transaction responses and broadcasts. The responses are sent via the shared memory transport to the Ultra communication framework. The standby side sends the response to the primary side’s Ultra communication framework, again using RDMA communica-tion.

5. Primary response handling

The primary side sends out the response as soon as it is received from either the primary or the standby, whichever arrives first.

This is the key to allowing orchestrated JVM pauses in the primary and standby servers.

Orchestration of JVM pausesThe Ultra communication framework handles the orches-tration of JVM pauses.

Since the primary and standby servers execute the same code with essentially the same memory turnover behavior, the orchestration mechanism does not need to be overly complex. Measuring memory turnover at maximum trans-action load, the orchestrator only needs to keep track of current memory usage and calculate when a garbage col-lection should occur. The initial garbage collection for one of the servers is requested so that future garbage collections will occur with optimal time spacing.

Next stepCinnober will publish an extensive white paper about or-chestrated memory management in TRADExpress in con-junction with the coming version release that is scheduled for the second quarter of 2014.

• Rewrite in C/C++• Use non-idiomatic

language constructs

✘ Lose language advantages

✘ Substantial effort to adopt

✘ Performance Overheads

No general, widely adopted solution!

The problem is not GC, it islanguage runtime system coordination

Current ApproachLanguage Runtime Systems are

completely independent (not just GC)Cluster Scheduler

App #3

Commodity OS

App #4

RuntimeRuntime

App #1

Commodity OS

App #2

RuntimeRuntime

Current Approach

Cluster Scheduler

App #3

Commodity OS

App #4

RuntimeRuntime

App #1

Commodity OS

App #2

RuntimeRuntimeIntra-nodeInterference

Language Runtime Systems are completely independent (not just GC)

Cluster Scheduler

App #3

Commodity OS

App #4

RuntimeRuntime

App #1

Commodity OS

App #2

Lack ofCoordination

Current Approach

Cluster Scheduler

App #3

Commodity OS

App #4

RuntimeRuntime

App #1

Commodity OS

App #2

Lack ofCoordination

RedundancyJIT JIT

Current Approach

Cluster Scheduler

App #3

Commodity OS

App #4

RuntimeRuntime

Commodity OS

RTIntra-nodeInterference

Lack ofCoordination

RedundancyElasticityApp

Current Approach

Holistic Runtime SystemsApply the Distributed OS Ideas to design a Distributed Language Runtime System

Cluster Scheduler

App #3 App #4

RuntimeRuntime

App #1 App #2

RuntimeRuntime

Holistic Runtime SystemsApply the Distributed OS Ideas to design a Distributed Language Runtime System

Cluster Scheduler

App #3 App #4App #1 App #2

Distributed Runtime HolisticRuntimeSystemRuntimeRuntimeRuntimeRuntime

Our Prototype• Coordinated runtime decisions using

a feedback loop with dist. consensus• Configured by Policy (written in DSL)• Drop-in replacement for Java VM• No modifying of application required

System Design

Hotspot JVM Hotspot JVM

Application Node 0 Application Node 1

Plan,Reconfiguration,State updates

User-suppliedPolicy

tic Ru

Monitor Monitor

State State

Why This Approach?• Easy to adopt (just pick policy, almost

no configuration required)• Minimally invasive to runtime system• Expressive (can express a large range

of GC coordination policies)44

Our plan is to make the system available as open source

Would you use it?

Thank you! Any Questions?

Martin Maas, Tim Harris, Krste Asanovic, John Kubiatowicz{maas,krste,kubitron}@eecs.berkeley.edu timothy.l.harris@oracle.com

Work started while at Oracle Labs, Cambridge.

T D : COORDINATING ARBAGE C DISTRIBUTED YSTEMS · side using RDMA (Remote Direct Memory Access) and...

Documents

Advancing RDMA - OpenFabrics Alliancedownloads.openfabrics.org/Media/Sonoma2009/Sonoma... · Advancing RDMA A proposal for RDMA ... ¾The IT guy can give his users much needed flexibility

RDMA on ARM

import rdma: zero-copy networking with RDMA and Python

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

NFS/RDMA over 40Gbps iWARP · 2019. 12. 21. · Reliable Datagram Service . RDS iSCSI RDMA Protocol (Initiator) iSER . SCSI RDMA Protocol (Initiator) SRP Sockets Direct Protocol SDP

z/OS V2R1 CS: Shared Memory Communications - RDMA (SMC … · 2014-03-09 · z/OS V2R1 CS: Shared Memory Communications - RDMA (SMC-R), Part 1 Gus Kassimis – kassimis@us.ibm.com

RDMA with GPU Memory via DMA-Buf - OpenFabrics Alliance...RDMA WITH SYSTEM MEMORY RDMA is “DMA + network” DMA requires proper setup of the memory •Memory pages are “pinned”

Octopus: an RDMA-enabled Distributed Persistent Memory ...nvmw.ucsd.edu/nvmw2019-program/unzip/current/nvmw2019-paper68... · Modular-Designed Distributed File System •DiskGluster

vSphere Networking - VMware vSphere 7...Network Requirements for RDMA over Converged Ethernet 160 Configure Remote Direct Memory Access Network Adapters 161 View RDMA Capable Network

IETF Sidemeeting: Large Scale Data Center HPC/RDMA · with network cooperation for RDMA fabric •Scope − Managed datacenter networks − RDMA traffics for applications, such as

[MS-SMBD]: SMB2 Remote Direct Memory Access (RDMA ...... · Direct Memory Access (RDMA) Transport Protocol is the Server Message Block (SMB) Protocol Versions 2 and 3 [MS-SMB2]. Sections

Efﬁcient Distributed Memory Management with RDMA and Caching · Efﬁcient Distributed Memory Management with RDMA and Caching Qingchao Cai, Wentian Guo, Hao Zhang, Divyakant Agrawaly,

NFS/RDMA Linux Client - Center for Information Technologyciti.umich.edu/projects/rdma/refs/NFS-RDMA_Linux_CThon_2004.pdf• Operation to Linux buffer cache fully supported • RDMA

Communications Server: New Shared Memory …...Insert Custom Session QR if Desired. Communications Server: New Shared Memory Communications over RDMA (SMC-R) Protocol – Concepts

RDMA CONTAINERS UPDATE

Introduction to RDMA Programming

X-RDMA: Effective RDMA Middleware in Large-scale Production … · X-RDMA: Effective RDMA Middleware in Large-scale Production Environments Teng Ma*, y, Tao Ma , Zhuo Song , Jingxuan

8 - SNIA - Persistent Memory · Held the first PM Summit (actually called “NVM Summit”) January 2014 ... PM Hardware Security Threat Model Remote persistent memory (via RDMA)

iWARP/RDMA Benefits in Windows 10 · 8K Read @10G - RDMA: 8.52, NIC: 17.61, Write - RDMA: 9.91, NIC: 16.48 8K Read @40G - RDMA: 8.43, NIC: 14.45, Write - RDMA: 10.51, NIC: 16.50 Throughput

G ARBAGE C OLLECTION CSCE-531 Ankur Jain Neeraj Agrawal 1