Redundancy Does Not Imply Fault Tolerance - USENIX · Redundancy Does Not Imply Fault Tolerance:...

Preview:

Citation preview

Redundancy Does Not Imply Fault Tolerance:Analysis of Distributed Storage Reactions

to Single Errors and Corruptions

Aishwarya Ganesan, Ramnatthan Alagappan,

Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau

Redundancy for Fault Tolerance

2

Redundancy for Fault Tolerance

2

Replication can mask failures

Redundancy for Fault Tolerance

2

Replication can mask failures

- System crashes

Redundancy for Fault Tolerance

2

Replication can mask failures

- System crashes

- Machine reboots

Redundancy for Fault Tolerance

2

Replication can mask failures

- Network failures

- System crashes

- Machine reboots

Redundancy in Distributed Storage

3

Redundancy in Distributed Storage

3

Depend on local file systems to store data

Redundancy in Distributed Storage

3

How about partial storage faults?

Depend on local file systems to store data

Redundancy in Distributed Storage

3

File system may return corrupted data on reads

How about partial storage faults?

Depend on local file systems to store data

Redundancy in Distributed Storage

3

File system may return corrupted data on reads

How about partial storage faults?

Depend on local file systems to store data

Redundancy in Distributed Storage

3

File system may return corrupted data on reads

- disk block corruption

How about partial storage faults?

Depend on local file systems to store data

Redundancy in Distributed Storage

3

File system may return corrupted data on reads

- disk block corruption

File system may return I/O error on a read

How about partial storage faults?

Depend on local file systems to store data

Redundancy in Distributed Storage

3

File system may return corrupted data on reads

- disk block corruption

File system may return I/O error on a read

- latent sector error

How about partial storage faults?

Depend on local file systems to store data

Redundancy in Distributed Storage

3

File system may return corrupted data on reads

- disk block corruption

File system may return I/O error on a read

- latent sector error

We call these file-system faults

How about partial storage faults?

Depend on local file systems to store data

Redundancy in Distributed Storage

3

File system may return corrupted data on reads

- disk block corruption

File system may return I/O error on a read

- latent sector error

We call these file-system faults

How about partial storage faults?

Depend on local file systems to store data

Do distributed storage systems use redundancy to recover from local file-system faults?

Our Study

4

Our StudyBehavior of eight distributed systems in response to file-system faults

4

Our StudyBehavior of eight distributed systems in response to file-system faults

Broad spectrum of replication and consensus protocols

4

Our StudyBehavior of eight distributed systems in response to file-system faults

Broad spectrum of replication and consensus protocols

Replicated state machines- ZooKeeper (uses ZAB for consensus)

- LogCabin, CockroachDB, and RethinkDB (uses RAFT for consensus)

4

Our StudyBehavior of eight distributed systems in response to file-system faults

Broad spectrum of replication and consensus protocols

Replicated state machines- ZooKeeper (uses ZAB for consensus)

- LogCabin, CockroachDB, and RethinkDB (uses RAFT for consensus)

Primary backup replication- MongoDB

- Redis

- Kafka (in-sync replicas for leader election)

4

Our StudyBehavior of eight distributed systems in response to file-system faults

Broad spectrum of replication and consensus protocols

Replicated state machines- ZooKeeper (uses ZAB for consensus)

- LogCabin, CockroachDB, and RethinkDB (uses RAFT for consensus)

Primary backup replication- MongoDB

- Redis

- Kafka (in-sync replicas for leader election)

Dynamo-style quorum- Cassandra (decentralized, no leader/follower)

4

Fault model

5

Fault model

A single fault to a single file-system block in a single node

5

Fault model

A single fault to a single file-system block in a single node

Type of faults:

- corruptions

- read errors

- write errors

5

Common Expectation

6

Common Expectation

6

Common Expectation

6

Fault Model: A single fault to one block in only one replica

Common Expectation

6

Fault Model: A single fault to one block in only one replica

Redundancy would enable recovery from local file-system faults

7

Redundancy Does Not Imply Fault Tolerance

7

Redundancy Does Not Imply Fault ToleranceA single fault in one node can cause catastrophic outcomes

7

Redundancy Does Not Imply Fault ToleranceA single fault in one node can cause catastrophic outcomes

- data loss, corruption, unavailability, and spread of corruption to other intact replicas

7

Redundancy Does Not Imply Fault ToleranceA single fault in one node can cause catastrophic outcomes

- data loss, corruption, unavailability, and spread of corruption to other intact replicas

Silent corruption

Unavailability Data lossReduced

redundancyQuery

failures

Redis

ZooKeeper

Cassandra

Kafka

RethinkDB

MongoDB

LogCabin

CockroachDB

7

Redundancy Does Not Imply Fault ToleranceA single fault in one node can cause catastrophic outcomes

- data loss, corruption, unavailability, and spread of corruption to other intact replicas

Silent corruption

Unavailability Data lossReduced

redundancyQuery

failures

Redis

ZooKeeper

Cassandra

Kafka

RethinkDB

MongoDB

LogCabin

CockroachDB

Why does Redundancy Not Imply Fault Tolerance?

8

Why does Redundancy Not Imply Fault Tolerance?

Some fundamental problems across multiple systems – not just implementation bugs!

8

Why does Redundancy Not Imply Fault Tolerance?

Some fundamental problems across multiple systems – not just implementation bugs!

Faults are often undetected locally – leads to harmful global effects

8

Why does Redundancy Not Imply Fault Tolerance?

Some fundamental problems across multiple systems – not just implementation bugs!

Faults are often undetected locally – leads to harmful global effects

On detection, crashing is the common action – redundancy underutilized

8

Why does Redundancy Not Imply Fault Tolerance?

Some fundamental problems across multiple systems – not just implementation bugs!

Faults are often undetected locally – leads to harmful global effects

On detection, crashing is the common action – redundancy underutilized

Crash and corruption handling are entangled – loss of committed data

8

Why does Redundancy Not Imply Fault Tolerance?

Some fundamental problems across multiple systems – not just implementation bugs!

Faults are often undetected locally – leads to harmful global effects

On detection, crashing is the common action – redundancy underutilized

Crash and corruption handling are entangled – loss of committed data

Unsafe interaction between local behavior and global distributed protocols can spread corruption or data loss

8

9

Fundamental Problems: Summary

9

Locally undetected

faults?

Crashing -common local

action?

Crash & corruption handling

entangled?

Unsafe interaction of local & global

protocols?

Redundancy underutilized for recovery?

Redis

ZooKeeper

Cassandra

Kafka

RethinkDB

MongoDB

LogCabin

CockroachDB

Fundamental Problems: Summary

9

Locally undetected

faults?

Crashing -common local

action?

Crash & corruption handling

entangled?

Unsafe interaction of local & global

protocols?

Redundancy underutilized for recovery?

Redis

ZooKeeper

Cassandra

Kafka

RethinkDB

MongoDB

LogCabin

CockroachDB

Fundamental Problems: Summary

9

Locally undetected

faults?

Crashing -common local

action?

Crash & corruption handling

entangled?

Unsafe interaction of local & global

protocols?

Redundancy underutilized for recovery?

Redis

ZooKeeper

Cassandra

Kafka

RethinkDB

MongoDB

LogCabin

CockroachDB

Fundamental Problems: Summary

9

Locally undetected

faults?

Crashing -common local

action?

Crash & corruption handling

entangled?

Unsafe interaction of local & global

protocols?

Redundancy underutilized for recovery?

Redis

ZooKeeper

Cassandra

Kafka

RethinkDB

MongoDB

LogCabin

CockroachDB

Fundamental Problems: Summary

9

Locally undetected

faults?

Crashing -common local

action?

Crash & corruption handling

entangled?

Unsafe interaction of local & global

protocols?

Redundancy underutilized for recovery?

Redis

ZooKeeper

Cassandra

Kafka

RethinkDB

MongoDB

LogCabin

CockroachDB

Fundamental Problems: Summary

9

Locally undetected

faults?

Crashing -common local

action?

Crash & corruption handling

entangled?

Unsafe interaction of local & global

protocols?

Redundancy underutilized for recovery?

Redis

ZooKeeper

Cassandra

Kafka

RethinkDB

MongoDB

LogCabin

CockroachDB

Fundamental Problems: Summary

Outline

Introduction

Fault Injection

System Behavior Analysis

Major Results

Observations Across Systems

Conclusion

10

Fault Model

11

Fault Model

11

Server 1 Server 2 Server 3

Request

Fault Model

11

Server 1 Server 2 Server 3Client

Request

Fault Model

11

Server 1 Server 2 Server 3Client

Request

Fault Model

11

Server 1 Server 2 Server 3Client

File System

read

/w

rite

File System

read

/w

rite

File System

read

/w

rite

Request

Fault Model

11

Server 1 Server 2 Server 3Client

File System

read

/w

rite

A single fault to a single file-system block in a single node

File System

read

/w

rite

File System

read

/w

rite

Request

Fault Model

11

Server 1 Server 2 Server 3Client

File System

read

/w

rite

A single fault to a single file-system block in a single node

Fault for current run:server 1, block B1 read corruption

File System

read

/w

rite

File System

read

/w

rite

Request

Fault Model

11

Server 1 Server 2 Server 3Client

File System

read

/w

rite

A single fault to a single file-system block in a single node

Fault for current run:server 1, block B1 read corruption

Fault for next run:server 1, block B1read error

File System

read

/w

rite

File System

read

/w

rite

Request

Fault Model

11

Server 1 Server 2 Server 3Client

File System

read

/w

rite

A single fault to a single file-system block in a single node

Faults injected only to user data not filesystem metadata

Fault for current run:server 1, block B1 read corruption

Fault for next run:server 1, block B1read error

File System

read

/w

rite

File System

read

/w

rite

Request

Fault Model: ext4 and btrfs

12

Server 1 Server 2 Server 3Client

File System File System File System

Request

Fault Model: ext4 and btrfs

12

Server 1 Server 2 Server 3Client

File System File System File Systemext4 ext4ext4

Request

Fault Model: ext4 and btrfs

12

Server 1 Server 2 Server 3Client

File System File System File Systemext4 ext4ext4

Request

Fault Model: ext4 and btrfs

12

Server 1 Server 2 Server 3Client

Co

rru

pt

dat

a

File System File System File Systemext4 ext4ext4

Request

Fault Model: ext4 and btrfs

12

Server 1 Server 2 Server 3Client

Co

rru

pt

dat

aC

orr

up

t d

ata

File System File System File Systemext4 ext4ext4

Request

Fault Model: ext4 and btrfs

12

Server 1 Server 2 Server 3Client

Co

rru

pt

dat

aC

orr

up

t d

ata

Ext4: disk corruption → corrupted data to applications

File System File System File Systemext4 ext4ext4

Request

Fault Model: ext4 and btrfs

12

Server 1 Server 2 Server 3Client

Ext4: disk corruption → corrupted data to applications

File System File System File Systemext4 ext4ext4 btrfsbtrfs btrfs

Request

Fault Model: ext4 and btrfs

12

Server 1 Server 2 Server 3Client

Co

rru

pt

dat

a

Ext4: disk corruption → corrupted data to applications

File System File System File Systemext4 ext4ext4 btrfsbtrfs btrfs

Request

Fault Model: ext4 and btrfs

12

Server 1 Server 2 Server 3Client

Co

rru

pt

dat

a

Ext4: disk corruption → corrupted data to applications

File System File System File Systemext4 ext4ext4 btrfsbtrfs btrfs

I/O

Er

ror

Request

Fault Model: ext4 and btrfs

12

Server 1 Server 2 Server 3Client

Co

rru

pt

dat

a

Ext4: disk corruption → corrupted data to applications

Btrfs: disk corruption → I/O error to applications

File System File System File Systemext4 ext4ext4 btrfsbtrfs btrfs

I/O

Er

ror

Fault Injection Methodology

13

Server 1 Server 2 Server 3

File System File SystemFile System

Fault Injection Methodology

13

Server 1 Server 2 Server 3

File System File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)

errfs - a FUSE file system to inject file-system faults

Fault Injection Methodology

13

Server 1 Server 2 Server 3

File System

Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)

errfs - a FUSE file system to inject file-system faults

Read

Fault Injection Methodology

13

Server 1 Server 2 Server 3Client

File System

Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)

errfs - a FUSE file system to inject file-system faults

Read

Fault Injection Methodology

13

Server 1 Server 2 Server 3Client

File System

Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)

read B1-B4

errfs - a FUSE file system to inject file-system faults

Read

Fault Injection Methodology

13

Server 1 Server 2 Server 3Client

File System

Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)

read B1-B4

read B1-B4

return B1-B4

errfs - a FUSE file system to inject file-system faults

Read

Fault Injection Methodology

13

Server 1 Server 2 Server 3Client

File System

Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)

read B1-B4

read B1-B4

return B1-B4

return B1’-B4

errfs - a FUSE file system to inject file-system faults

Read

Fault Injection Methodology

13

Server 1 Server 2 Server 3Client

File System

Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)

read B1-B4

read B1-B4

return B1-B4

return B1’-B4

errfs - a FUSE file system to inject file-system faults

Local Behavior

Read

Fault Injection Methodology

13

Server 1 Server 2 Server 3Client

File System

Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)

read B1-B4

read B1-B4

return B1-B4

return B1’-B4

errfs - a FUSE file system to inject file-system faults

Local BehaviorCrash RetryIgnore faulty dataNo detection/recovery

Read

Fault Injection Methodology

13

Server 1 Server 2 Server 3Client

File System

Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)

read B1-B4

read B1-B4

return B1-B4

return B1’-B4

errfs - a FUSE file system to inject file-system faults

Local BehaviorCrash RetryIgnore faulty dataNo detection/recovery Global Effect

Read

Fault Injection Methodology

13

Server 1 Server 2 Server 3Client

File System

Fault for current run:server 1, block B1 read corruption File SystemFile Systemerrfs (FUSE FS)errfs (FUSE FS) errfs (FUSE FS)

read B1-B4

read B1-B4

return B1-B4

return B1’-B4

errfs - a FUSE file system to inject file-system faults

Local BehaviorCrash RetryIgnore faulty dataNo detection/recovery Global Effect

CorruptionData lossUnavailability

Outline

Introduction

Fault Injection

System Behavior Analysis

Major Results

Observations Across Systems

Conclusion

14

System Behavior AnalysisBehavior of eight distributed systems in response to file-system faults

Broad spectrum of replication and consensus protocols

Replicated state machines- ZooKeeper (uses ZAB for consensus)

- LogCabin, CockroachDB, and RethinkDB (uses RAFT for consensus)

Primary backup replication- MongoDB

- Redis

- Kafka (in-sync replicas for leader election)

Dynamo-style quorum- Cassandra (decentralized, no leader/follower)

15

An Example: Redis

16

An Example: Redis

16

Redis is a popular data structure store

Leader

An Example: Redis

16

Redis is a popular data structure store

Follower

Follower

Leader

An Example: Redis

16

Redis is a popular data structure store

Follower

Follower

Leader

An Example: Redis

16

appendonlyfile

appendonlyfile

appendonlyfile

Redis is a popular data structure store

Follower

Follower

Leader

An Example: Redis

16

appendonlyfile

appendonlyfile

appendonlyfile

ClientWrite

Redis is a popular data structure store

Follower

Follower

Leader

An Example: Redis

16

appendonlyfile

appendonlyfile

appendonlyfile

ClientWrite

Redis is a popular data structure store

Follower

Follower

Leader

An Example: Redis

16

appendonlyfile

appendonlyfile

appendonlyfile

ClientWrite

Redis is a popular data structure store

Follower

Follower

Leader

An Example: Redis

16

appendonlyfile

appendonlyfile

appendonlyfile

ClientWrite

Redis is a popular data structure store

Follower

Follower

Leader

An Example: Redis

16

appendonlyfile

appendonlyfile

appendonlyfile

Redis is a popular data structure store

Follower

Follower

Leader

An Example: Redis

16

redis_database

appendonlyfile

appendonlyfile

appendonlyfile

Redis is a popular data structure store

Follower

Follower

Leader

An Example: Redis

16

redis_database

appendonlyfile

appendonlyfile

appendonlyfile

Redis is a popular data structure store

Follower

Follower

Leader

An Example: Redis

16

redis_database

appendonlyfile

appendonlyfile

appendonlyfile

Redis is a popular data structure store

Follower

Follower

Leader

An Example: Redis

16

redis_database

redis_database

redis_database

appendonlyfile

appendonlyfile

appendonlyfile

Redis is a popular data structure store

Follower

Follower

Leader

An Example: Redis

16

redis_database

redis_database

redis_database

appendonlyfile

appendonlyfile

appendonlyfile

Redis is a popular data structure store

Redis: Behavior Analysis

17

Redis: Behavior Analysis

17

Read Workload

Redis: Behavior Analysis

17

Local Behavior

Read Workload

Redis: Behavior Analysis

17

CorruptRead

I/O Error

Local Behavior

Read Workload

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderL

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Local Behavior

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

UnavailabilityNo checksums to detect corruptionLeader crashes due to failed deserializationNo automatic failover - cluster unavailable

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

ReducedRedundancy

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

ReducedRedundancy

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

ReducedRedundancy

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

ReducedRedundancy

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

ReducedRedundancy

No Detection/ No Recovery

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

ReducedRedundancy

No Detection/ No Recovery

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

ReducedRedundancy

No Detection/ No Recovery

Corruption

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

ReducedRedundancy

No Detection/ No Recovery

CorruptionNo checksums to detect corruptionLeader returns corrupted data on queriesCorruption propagation to followers

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

ReducedRedundancy

No Detection/ No Recovery

Corruption

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

ReducedRedundancy

No Detection/ No Recovery

Corruption

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

ReducedRedundancy

No Detection/ No Recovery

Corruption

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

ReducedRedundancy

No Detection/ No Recovery

Corruption

Correct

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

ReducedRedundancy

No Detection/ No Recovery

Corruption

Correct

Redis: Behavior Analysis

17

CorruptRead

I/O Error

L LF F

Local Behavior

Read Workload

LeaderLFollowerF Crash

On-disk Structures

appendonlyfile.metadata

appendonlyfile.data

redis_database.block_0

redis_database.metadata

redis_database.userdata

Global Effect

CorruptRead

I/O Error

L LF F

Local Behavior

Global Effect

Unavailability

ReducedRedundancy

No Detection/ No Recovery

Corruption

Retry

WriteUnavailability

Correct

Other Systems

18

Other Systems

Metadata stores: ZooKeeper, LogCabin

Wide column store: Cassandra

Document stores: MongoDB

Distributed databases: RethinkDB, CockroachDB

Message Queues: Kafka

18

Outline

Introduction

Fault Injection

System Behavior Analysis

Major Results

Redundancy Does not Provide Fault Tolerance

Observations Across Systems

Conclusion

19

Redundancy Does not Provide Fault Tolerance

20

Redundancy Does not Provide Fault Tolerance

20

Redis Read

ZooKeeper Write

Kafka Read

RethinkDB Read

Cassandra ReadKafka Write

Redundancy Does not Provide Fault Tolerance

20

Redis Read

CorruptReadError

ZooKeeper WriteWrite Error

Kafka Read

RethinkDB ReadCorrupt

Cassandra ReadKafka Write

CorruptReadError Corrupt

ReadError

CorruptReadError

Redundancy Does not Provide Fault Tolerance

20

Redis Read

CorruptReadError

txn_headlog.tail

ZooKeeper WriteWrite Error

log.headerlog.otherreplication

Kafka Read

aof.metadataaof.datardb.metadatardb.userdata

RethinkDB Read

db.txn_headdb.txn_bodydb.txn_taildb.metablock

Corrupt

Cassandra ReadKafka Write

checkpoint

CorruptReadError Corrupt

ReadError

CorruptReadError

sstable.block0sstable.metadatasstable.userdatasstable.index

Redundancy Does not Provide Fault Tolerance

20

Redis Read

CorruptReadError

txn_headlog.tail

ZooKeeper WriteWrite Error

log.headerlog.otherreplication

L F L F

L F

L F L F

Kafka Read

aof.metadataaof.datardb.metadatardb.userdata

RethinkDB Read

db.txn_headdb.txn_bodydb.txn_taildb.metablock

L F

Corrupt

Cassandra ReadKafka Write

checkpointL F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

sstable.block0sstable.metadatasstable.userdatasstable.index

Redundancy Does not Provide Fault Tolerance

20

Redis Read

CorruptReadError

txn_headlog.tail

ZooKeeper WriteWrite Error

log.headerlog.otherreplication

L F L F

L F

L F L F

Kafka Read

aof.metadataaof.datardb.metadatardb.userdata

RethinkDB Read

db.txn_headdb.txn_bodydb.txn_taildb.metablock

L F

CorruptionCorrupt

Cassandra ReadKafka Write

checkpointL F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

sstable.block0sstable.metadatasstable.userdatasstable.index

Redundancy Does not Provide Fault Tolerance

20

Redis Read

CorruptReadError

txn_headlog.tail

ZooKeeper WriteWrite Error

log.headerlog.otherreplication

L F L F

L F

L F L F

Kafka Read

aof.metadataaof.datardb.metadatardb.userdata

RethinkDB Read

db.txn_headdb.txn_bodydb.txn_taildb.metablock

L F

Corruption

Data Loss

Corrupt

Cassandra ReadKafka Write

checkpointL F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

sstable.block0sstable.metadatasstable.userdatasstable.index

Redundancy Does not Provide Fault Tolerance

20

Redis Read

CorruptReadError

txn_headlog.tail

ZooKeeper WriteWrite Error

log.headerlog.otherreplication

L F L F

L F

L F L F

Kafka Read

aof.metadataaof.datardb.metadatardb.userdata

RethinkDB Read

db.txn_headdb.txn_bodydb.txn_taildb.metablock

L F

Corruption

Write Unavailability

Data Loss

Corrupt

Cassandra ReadKafka Write

checkpointL F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

sstable.block0sstable.metadatasstable.userdatasstable.index

Redundancy Does not Provide Fault Tolerance

20

Redis Read

CorruptReadError

txn_headlog.tail

ZooKeeper WriteWrite Error

log.headerlog.otherreplication

L F L F

L F

L F L F

Kafka Read

aof.metadataaof.datardb.metadatardb.userdata

RethinkDB Read

db.txn_headdb.txn_bodydb.txn_taildb.metablock

L F

Corruption

Write Unavailability

Data Loss

UnavailabilityCorrupt

Cassandra ReadKafka Write

checkpointL F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

sstable.block0sstable.metadatasstable.userdatasstable.index

Redundancy Does not Provide Fault Tolerance

20

Redis Read

CorruptReadError

txn_headlog.tail

ZooKeeper WriteWrite Error

log.headerlog.otherreplication

L F L F

L F

L F L F

Kafka Read

aof.metadataaof.datardb.metadatardb.userdata

RethinkDB Read

db.txn_headdb.txn_bodydb.txn_taildb.metablock

L F

Corruption

Write Unavailability

Data Loss

UnavailabilityCorrupt

Query Failure

Cassandra ReadKafka Write

checkpointL F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

sstable.block0sstable.metadatasstable.userdatasstable.index

Redundancy Does not Provide Fault Tolerance

20

Redis Read

CorruptReadError

txn_headlog.tail

ZooKeeper WriteWrite Error

log.headerlog.otherreplication

L F L F

L F

L F L F

Kafka Read

aof.metadataaof.datardb.metadatardb.userdata

RethinkDB Read

db.txn_headdb.txn_bodydb.txn_taildb.metablock

L F

Corruption

Write Unavailability

Data Loss

UnavailabilityCorrupt

Query Failure

Cassandra ReadKafka Write

checkpointL F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

sstable.block0sstable.metadatasstable.userdatasstable.index

Reduced Redundancy

Redundancy Does not Provide Fault Tolerance

20

Redis Read

CorruptReadError

txn_headlog.tail

ZooKeeper WriteWrite Error

log.headerlog.otherreplication

L F L F

L F

L F L F

Kafka Read

aof.metadataaof.datardb.metadatardb.userdata

RethinkDB Read

db.txn_headdb.txn_bodydb.txn_taildb.metablock

L F

Corruption

Write Unavailability

Data Loss

UnavailabilityCorrupt

Query Failure

Cassandra ReadKafka Write

checkpointL F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

sstable.block0sstable.metadatasstable.userdatasstable.index

Reduced Redundancy

Harmful global effects despite redundancy

Redundancy Does not Provide Fault Tolerance

20

Redis Read

CorruptReadError

txn_headlog.tail

ZooKeeper WriteWrite Error

log.headerlog.otherreplication

L F L F

L F

L F L F

Kafka Read

aof.metadataaof.datardb.metadatardb.userdata

RethinkDB Read

db.txn_headdb.txn_bodydb.txn_taildb.metablock

L F

Corruption

Write Unavailability

Data Loss

UnavailabilityCorrupt

Query Failure

Cassandra ReadKafka Write

checkpointL F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

sstable.block0sstable.metadatasstable.userdatasstable.index

Reduced Redundancy

Harmful global effects despite redundancyNot simple implementation bugs - fundamental problems across multiple systems!

OutlineIntroduction

Fault Injection

System Behavior Analysis

Major Results

Observations Across Systems

Faults are Often Undetected Locally

Crashing: Common Local Reaction

Crash and Corruption Handling are Entangled

Unsafe interaction between local and global protocols

Conclusion

21

Faults are Often Undetected Locally

22

Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect

22

Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect

- Locally undetected corruption → global silent corruption

22

Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect

- Locally undetected corruption → global silent corruption

22

Cassandra: Locally Undetected Fault

Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect

- Locally undetected corruption → global silent corruption

22

Client Replica 1 Other Replicas

Cassandra: Locally Undetected Fault

sstable

Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect

- Locally undetected corruption → global silent corruption

22

key | value

Client Replica 1 Other Replicas

key | value

Cassandra: Locally Undetected Fault

sstable

sstable

Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect

- Locally undetected corruption → global silent corruption

22

key | value

Client Replica 1 Other Replicas

key | value

sstable.userdata corrupted

Cassandra: Locally Undetected Fault

sstable

sstable compression = offNo checksums to detect corruption

sstable

Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect

- Locally undetected corruption → global silent corruption

22

key | value

Client Replica 1 Other Replicas

key | value

sstable.userdata corrupted

Cassandra: Locally Undetected Fault

sstable

sstable compression = offNo checksums to detect corruption

sstable

Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect

- Locally undetected corruption → global silent corruption

22

key | value

Client Replica 1 Other Replicas

key | value

sstable.userdata corrupted

Cassandra: Locally Undetected Fault

sstable

READ (R=1)

sstable compression = offNo checksums to detect corruption

sstable

Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect

- Locally undetected corruption → global silent corruption

22

key | value

Client Replica 1 Other Replicas

key | value

sstable.userdata corrupted

Cassandra: Locally Undetected Fault

sstable

CORRUPT

READ (R=1)

Read Repair

sstable compression = offNo checksums to detect corruption

sstable

Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect

- Locally undetected corruption → global silent corruption

22

key | value

Client Replica 1 Other Replicas

key | value

sstable.userdata corrupted

Cassandra: Locally Undetected Fault

sstable

CORRUPT

READ (R=1)

Read Repair

sstable compression = offNo checksums to detect corruption

sstable

Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect

- Locally undetected corruption → global silent corruption

22

key | value

Client Replica 1 Other Replicas

key | value

sstable.userdata corrupted

key | value

Cassandra: Locally Undetected Fault

sstable

CORRUPT

READ (R=1)

Read Repair

sstable compression = offNo checksums to detect corruption

sstable

Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect

- Locally undetected corruption → global silent corruption

22

key | value

Client Replica 1 Other Replicas

key | value

sstable.userdata corrupted

key | valueREAD

CORRUPT

Cassandra: Locally Undetected Fault

sstable

CORRUPT

READ (R=1)

Read Repair

sstable compression = offNo checksums to detect corruption

sstable

Faults are Often Undetected LocallyUndetected faults may lead to harmful global effect

- Locally undetected corruption → global silent corruption

22

key | value

Client Replica 1 Other Replicas

key | value

sstable.userdata corrupted

key | valueREAD

CORRUPT

Cassandra: Locally Undetected Fault

sstable

CORRUPT

READ (R=1)

Need for end-to-end integrity and error handling

Crashing - Common Local Reaction

23

Crashing - Common Local Reaction

23

Many systems that reliably detect fault simply crash on encountering faults

collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt

Crashing - Common Local Reaction

23

Many systems that reliably detect fault simply crash on encountering faults

MongoDBBlock Corruption during Read Workloads

L F

Crash

Leader

Follower

L

F

collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt

Crashing - Common Local Reaction

23

Many systems that reliably detect fault simply crash on encountering faults

MongoDBBlock Corruption during Read Workloads

L F

Crash

Leader

Follower

L

F

collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt

Crashing - Common Local Reaction

23

Many systems that reliably detect fault simply crash on encountering faults

MongoDBBlock Corruption during Read Workloads

L F

epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail

ZooKeeper

L F

Crash

Leader

Follower

L

F

collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt

Crashing - Common Local Reaction

23

Many systems that reliably detect fault simply crash on encountering faults

MongoDBBlock Corruption during Read Workloads

L F

epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail

ZooKeeper

L F

Crash

Leader

Follower

L

F

collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt

Crashing - Common Local Reaction

23

Many systems that reliably detect fault simply crash on encountering faults

MongoDBBlock Corruption during Read Workloads

L F

epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail

ZooKeeper

L F

Crash

Leader

Follower

L

F

Crashing leads to reduced redundancy and imminent unavailability

collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt

Crashing - Common Local Reaction

23

Many systems that reliably detect fault simply crash on encountering faults

MongoDBBlock Corruption during Read Workloads

L F

epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail

ZooKeeper

L F

Crash

Leader

Follower

L

F

Crashing leads to reduced redundancy and imminent unavailabilityPersistent fault -- Requires manual intervention

collections.headercollections.metadatacollections.dataindexjournal.headerjournal.otherstorage_bsonwiredtiger_wt

Crashing - Common Local Reaction

23

Many systems that reliably detect fault simply crash on encountering faults

MongoDBBlock Corruption during Read Workloads

L F

epochepoch_tmpmyidlog.transaction_headlog.transaction_bodylog.transaction_taillog.remaininglog.tail

ZooKeeper

L F

Crash

Leader

Follower

L

F

Crashing leads to reduced redundancy and imminent unavailabilityPersistent fault -- Requires manual intervention

Redundancy underutilized!

Crash and Corruption Handling are Entangled

24

Crash and Corruption Handling are EntangledKafka Message Log

24

Crash and Corruption Handling are EntangledKafka Message Log

0

24

Crash and Corruption Handling are EntangledKafka Message Log

0

data

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1

Append(log, entry 2)

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1

Append(log, entry 2)

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

Checksum mismatch

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

Checksum mismatch

Action: Truncate log at 1

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

Checksum mismatch

Action: Truncate log at 1

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

Checksum mismatch

Action: Truncate log at 1

Lose uncommitted data

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

Checksum mismatch

Action: Truncate log at 1

Lose uncommitted data

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

Checksum mismatch

Action: Truncate log at 1

Lose uncommitted data

0 1 2

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

Checksum mismatch

Action: Truncate log at 1

Lose uncommitted data

0 1 2

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

Checksum mismatch

Action: Truncate log at 1

Disk corruption

Lose uncommitted data

0 1 2

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

Checksum mismatch

Checksum mismatch

Action: Truncate log at 1

Disk corruption

Lose uncommitted data

0 1 2

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

Checksum mismatch

Checksum mismatch

Action: Truncate log at 1

Disk corruption

Action: Truncate log at 0

Lose uncommitted data

0 1 2

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

Checksum mismatch

Checksum mismatch

Action: Truncate log at 1

Disk corruption

Action: Truncate log at 0

Lose uncommitted data

0 1 2

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

Checksum mismatch

Checksum mismatch

Action: Truncate log at 1

Disk corruption

Action: Truncate log at 0

Lose uncommitted data

Lose committed data!

0 1 2

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

Checksum mismatch

Checksum mismatch

Action: Truncate log at 1

Disk corruption

Action: Truncate log at 0

Lose uncommitted data

Lose committed data!

0 1 2

Developers of LogCabin and RethinkDB agree entanglement is the problem

24

Crash and Corruption Handling are EntangledKafka Message Log

0

checksum data

1 2

Append(log, entry 2)

Checksum mismatch

Checksum mismatch

Action: Truncate log at 1

Disk corruption

Action: Truncate log at 0

Lose uncommitted data

Lose committed data!

0 1 2

Developers of LogCabin and RethinkDB agree entanglement is the problem

24Need for discerning corruptions due to crashes from other type of corruptions

Unsafe Interaction between Local & Global Protocols

25

Unsafe Interaction between Local & Global Protocols

0 1 2

25

Kafka: Message log at Node 1

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatch

0 1 2

25

Kafka: Message log at Node 1

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0

0 1 2

25

Kafka: Message log at Node 1

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Local Behavior

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Local Behavior

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Local Behavior

0 1 2

Node1 Other Nodes

0 1 2

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Local Behavior

0 1 2

Node1 Other Nodes

0 1 2

Set of in-sync replicas

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Local Behavior

0 1 2

Node1 Other Nodes

0 1 2

Set of in-sync replicas

Node1 with truncated log not removed from in-sync replicas

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Local Behavior

0 1 2

Node1 Other Nodes

0 1 2

Leader FollowersSet of in-sync replicas

Node1 with truncated log not removed from in-sync replicas

Node 1 elected as leader

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Local Behavior

0 1 2

Client Node1 Other NodesREAD

0 1 2

Leader FollowersSet of in-sync replicas

Node1 with truncated log not removed from in-sync replicas

Node 1 elected as leader

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Local Behavior

0 1 2

Client Node1 Other Nodes

message:0

[Silent data loss]

READ

0 1 2

Leader FollowersSet of in-sync replicas

Node1 with truncated log not removed from in-sync replicas

Node 1 elected as leader

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Local Behavior

0 1 2

Client Node1 Other Nodes

message:0

[Silent data loss]

READ

Truncate upto message 0

0 1 2

Leader FollowersSet of in-sync replicas

Node1 with truncated log not removed from in-sync replicas

Node 1 elected as leader

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Local Behavior

0 1 2

Client Node1 Other Nodes

message:0

[Silent data loss]

READ

Truncate upto message 0

0 1 2

Assertion failure

Leader FollowersSet of in-sync replicas

Node1 with truncated log not removed from in-sync replicas

Node 1 elected as leader

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Local Behavior

0 1 2

Client Node1 Other Nodes

message:0

[Silent data loss]

READ

Truncate upto message 0

0 1 2

Assertion failureWRITE (W=2)

Leader FollowersSet of in-sync replicas

Node1 with truncated log not removed from in-sync replicas

Node 1 elected as leader

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Local Behavior

0 1 2

Client Node1 Other Nodes

message:0

[Silent data loss]

READ

Truncate upto message 0

0 1 2

Assertion failure

Failure

WRITE (W=2)

Leader FollowersSet of in-sync replicas

Node1 with truncated log not removed from in-sync replicas

Node 1 elected as leader

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Local Behavior

0 1 2

Client Node1 Other Nodes

message:0

[Silent data loss]

READ

Truncate upto message 0

0 1 2

Assertion failure

Failure

WRITE (W=2)

Leader FollowersSet of in-sync replicas

Node1 with truncated log not removed from in-sync replicas

Node 1 elected as leader

Unsafe interaction between local behavior and leader election protocol leads to data loss and write unavailability

Unsafe Interaction between Local & Global ProtocolsDisk corruptionChecksum mismatchAction: Truncate log at 0Lose committed data!0 1 2

25

Kafka: Message log at Node 1

Local Behavior

0 1 2

Client Node1 Other Nodes

message:0

[Silent data loss]

READ

Truncate upto message 0

0 1 2

Assertion failure

Failure

WRITE (W=2)

Leader FollowersSet of in-sync replicas

Node1 with truncated log not removed from in-sync replicas

Node 1 elected as leader

Need for synergy between local behavior and global protocol

Unsafe interaction between local behavior and leader election protocol leads to data loss and write unavailability

Redundancy Does not Provide Fault Tolerance

26

Redis Read

CorruptReadError

txn_headlog.tail

ZooKeeper WriteWrite Error

log.headerlog.otherreplication

L F L F

L F

L F L F

Kafka Read

aof.metadataaof.datardb.metadatardb.userdata

RethinkDB Read

db.txn_headdb.txn_bodydb.txn_taildb.metablock

L F

Corruption

Write Unavailability

Data Loss

UnavailabilityCorrupt

Query Failure

Cassandra ReadKafka Write

checkpointL F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

sstable.block0sstable.metadatasstable.userdatasstable.index

Reduced Redundancy

Why does Redundancy Not Imply Fault Tolerance?

27

Redis Read

CorruptReadError

ZooKeeper Write

Write Error

L F L F

L F

L F L F

Kafka Read

RethinkDBRead

L F

Corrupt

Cassandra Read

Kafka Write

L F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

Why does Redundancy Not Imply Fault Tolerance?

27

Redis Read

CorruptReadError

ZooKeeper Write

Write Error

L F L F

L F

L F L F

Kafka Read

RethinkDBRead

L F

Corrupt

Cassandra Read

Kafka Write

L F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

Why does Redundancy Not Imply Fault Tolerance?

27

Redis Read

CorruptReadError

ZooKeeper Write

Write Error

L F L F

L F

L F L F

Kafka Read

RethinkDBRead

L F

Corrupt

Cassandra Read

Kafka Write

L F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

Faults are often locally undetected

Why does Redundancy Not Imply Fault Tolerance?

27

Redis Read

CorruptReadError

ZooKeeper Write

Write Error

L F L F

L F

L F L F

Kafka Read

RethinkDBRead

L F

Corrupt

Cassandra Read

Kafka Write

L F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

Faults are often locally undetected

Why does Redundancy Not Imply Fault Tolerance?

27

Redis Read

CorruptReadError

ZooKeeper Write

Write Error

L F L F

L F

L F L F

Kafka Read

RethinkDBRead

L F

Corrupt

Cassandra Read

Kafka Write

L F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

Faults are often locally undetected

Crashing on detecting faults is the common reaction

Why does Redundancy Not Imply Fault Tolerance?

27

Redis Read

CorruptReadError

ZooKeeper Write

Write Error

L F L F

L F

L F L F

Kafka Read

RethinkDBRead

L F

Corrupt

Cassandra Read

Kafka Write

L F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

Faults are often locally undetected

Crashing on detecting faults is the common reaction

Why does Redundancy Not Imply Fault Tolerance?

27

Redis Read

CorruptReadError

ZooKeeper Write

Write Error

L F L F

L F

L F L F

Kafka Read

RethinkDBRead

L F

Corrupt

Cassandra Read

Kafka Write

L F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

Faults are often locally undetected

Crashing on detecting faults is the common reaction

Crash and corruption handling are entangled

Why does Redundancy Not Imply Fault Tolerance?

27

Redis Read

CorruptReadError

ZooKeeper Write

Write Error

L F L F

L F

L F L F

Kafka Read

RethinkDBRead

L F

Corrupt

Cassandra Read

Kafka Write

L F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

Faults are often locally undetected

Crashing on detecting faults is the common reaction

Crash and corruption handling are entangled

Why does Redundancy Not Imply Fault Tolerance?

27

Redis Read

CorruptReadError

ZooKeeper Write

Write Error

L F L F

L F

L F L F

Kafka Read

RethinkDBRead

L F

Corrupt

Cassandra Read

Kafka Write

L F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

Faults are often locally undetected

Crashing on detecting faults is the common reaction

Crash and corruption handling are entangled

Unsafe interaction between local and global protocols

Why does Redundancy Not Imply Fault Tolerance?

27

Redis Read

CorruptReadError

ZooKeeper Write

Write Error

L F L F

L F

L F L F

Kafka Read

RethinkDBRead

L F

Corrupt

Cassandra Read

Kafka Write

L F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

Faults are often locally undetected

Crashing on detecting faults is the common reaction

Crash and corruption handling are entangled

Unsafe interaction between local and global protocols

Why does Redundancy Not Imply Fault Tolerance?

27

Redis Read

CorruptReadError

ZooKeeper Write

Write Error

L F L F

L F

L F L F

Kafka Read

RethinkDBRead

L F

Corrupt

Cassandra Read

Kafka Write

L F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

Faults are often locally undetected

Not simple implementation bugs - fundamental problems across multiple systems!

Crashing on detecting faults is the common reaction

Crash and corruption handling are entangled

Unsafe interaction between local and global protocols

Why does Redundancy Not Imply Fault Tolerance?

27

Redis Read

CorruptReadError

ZooKeeper Write

Write Error

L F L F

L F

L F L F

Kafka Read

RethinkDBRead

L F

Corrupt

Cassandra Read

Kafka Write

L F L F

CorruptReadError Corrupt

ReadError

CorruptReadError

Faults are often locally undetected

Not simple implementation bugs - fundamental problems across multiple systems!Redundancy underutilized as a source of recovery

Crashing on detecting faults is the common reaction

Crash and corruption handling are entangled

Unsafe interaction between local and global protocols

28

Fundamental Problems: Summary

28

Fundamental Problems: Summary

Locally undetected

faults?

Crashing -common local

action?

Crash & corruption handling

entangled?

Unsafe interaction of local & global

protocols?

Redundancy underutilized for recovery?

Redis

ZooKeeper

Cassandra

Kafka

RethinkDB

MongoDB

LogCabin

CockroachDB

28

Fundamental Problems: Summary

Locally undetected

faults?

Crashing -common local

action?

Crash & corruption handling

entangled?

Unsafe interaction of local & global

protocols?

Redundancy underutilized for recovery?

Redis

ZooKeeper

Cassandra

Kafka

RethinkDB

MongoDB

LogCabin

CockroachDB

28

Fundamental Problems: Summary

Locally undetected

faults?

Crashing -common local

action?

Crash & corruption handling

entangled?

Unsafe interaction of local & global

protocols?

Redundancy underutilized for recovery?

Redis

ZooKeeper

Cassandra

Kafka

RethinkDB

MongoDB

LogCabin

CockroachDB

28

Fundamental Problems: Summary

More observations, results, and discussions in the paper …

Locally undetected

faults?

Crashing -common local

action?

Crash & corruption handling

entangled?

Unsafe interaction of local & global

protocols?

Redundancy underutilized for recovery?

Redis

ZooKeeper

Cassandra

Kafka

RethinkDB

MongoDB

LogCabin

CockroachDB

OutlineIntroduction

Fault Injection

System Behavior Analysis

Major Results

Observations Across Systems

Faults are Often Undetected Locally

Crashing: Common Local Reaction

Crash and Corruption Handling are Entangled

Unsafe interaction between local and global protocols

Conclusion

29

Summary

30

SummaryWe analyzed distributed storage reactions to single file-system faults

30

SummaryWe analyzed distributed storage reactions to single file-system faults

Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB

30

SummaryWe analyzed distributed storage reactions to single file-system faults

Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB

Redundancy does not provide fault tolerance

30

SummaryWe analyzed distributed storage reactions to single file-system faults

Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB

Redundancy does not provide fault tolerance

A single fault in one node can cause catastrophic outcomes

30

SummaryWe analyzed distributed storage reactions to single file-system faults

Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB

Redundancy does not provide fault tolerance

A single fault in one node can cause catastrophic outcomes

data loss, corruption, unavailability, and spread of corruption to other intact replicas

30

SummaryWe analyzed distributed storage reactions to single file-system faults

Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB

Redundancy does not provide fault tolerance

A single fault in one node can cause catastrophic outcomes

data loss, corruption, unavailability, and spread of corruption to other intact replicas

Some fundamental problems across multiple systems:

30

SummaryWe analyzed distributed storage reactions to single file-system faults

Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB

Redundancy does not provide fault tolerance

A single fault in one node can cause catastrophic outcomes

data loss, corruption, unavailability, and spread of corruption to other intact replicas

Some fundamental problems across multiple systems:

Faults are often undetected locally – leads to harmful global effects

30

SummaryWe analyzed distributed storage reactions to single file-system faults

Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB

Redundancy does not provide fault tolerance

A single fault in one node can cause catastrophic outcomes

data loss, corruption, unavailability, and spread of corruption to other intact replicas

Some fundamental problems across multiple systems:

Faults are often undetected locally – leads to harmful global effects

On detection, crashing is the common action – redundancy underutilized

30

SummaryWe analyzed distributed storage reactions to single file-system faults

Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB

Redundancy does not provide fault tolerance

A single fault in one node can cause catastrophic outcomes

data loss, corruption, unavailability, and spread of corruption to other intact replicas

Some fundamental problems across multiple systems:

Faults are often undetected locally – leads to harmful global effects

On detection, crashing is the common action – redundancy underutilized

Crash and corruption handling are entangled – loss of committed data

30

SummaryWe analyzed distributed storage reactions to single file-system faults

Redis, ZooKeeper, Cassandra, Kafka, MongoDB, LogCabin, RethinkDB, and CockroachDB

Redundancy does not provide fault tolerance

A single fault in one node can cause catastrophic outcomes

data loss, corruption, unavailability, and spread of corruption to other intact replicas

Some fundamental problems across multiple systems:

Faults are often undetected locally – leads to harmful global effects

On detection, crashing is the common action – redundancy underutilized

Crash and corruption handling are entangled – loss of committed data

Unsafe interaction between local behavior and global distributed protocols can spread corruption or data loss

30

Conclusion

31

Conclusion

Most distributed systems not yet resilient

31

Conclusion

Most distributed systems not yet resilient

Always detect faults – important in layered stacks on commodity hardware

31

Conclusion

Most distributed systems not yet resilient

Always detect faults – important in layered stacks on commodity hardware

Detecting faults and not using redundancy to recover is undesirable

31

Conclusion

Most distributed systems not yet resilient

Always detect faults – important in layered stacks on commodity hardware

Detecting faults and not using redundancy to recover is undesirable

Cannot always assume corruption to be caused by a crash

31

Conclusion

Most distributed systems not yet resilient

Always detect faults – important in layered stacks on commodity hardware

Detecting faults and not using redundancy to recover is undesirable

Cannot always assume corruption to be caused by a crash

Local behavior has implications for distributed systems

31

Conclusion

Most distributed systems not yet resilient

Always detect faults – important in layered stacks on commodity hardware

Detecting faults and not using redundancy to recover is undesirable

Cannot always assume corruption to be caused by a crash

Local behavior has implications for distributed systems

Our study provides directions for more robust distributed storage design

31

Conclusion

Most distributed systems not yet resilient

Always detect faults – important in layered stacks on commodity hardware

Detecting faults and not using redundancy to recover is undesirable

Cannot always assume corruption to be caused by a crash

Local behavior has implications for distributed systems

Our study provides directions for more robust distributed storage design

Our fault injection framework available online: http://research.cs.wisc.edu/adsl/Software/cords/

Thank you!31

Recommended