50
Building reliable real-time services with Apache DistributedLog @sijieg

Apache distributed log @ q con 2017

Embed Size (px)

Citation preview

Page 1: Apache distributed log @ q con 2017

Building reliable real-time serviceswith Apache DistributedLog

@sijieg

Page 2: Apache distributed log @ q con 2017

Logs are Everywhere● DB Storage Engines - WAL

● DB Replication - Binlog, Log shipping

● Distributed Consensus - Replicated log

● Messaging/Pub-Sub - Kafka

Page 3: Apache distributed log @ q con 2017

Apache DistributedLog

Page 4: Apache distributed log @ q con 2017

Log StreamAn endless, totally ordered,

sequence of immutable records

Page 5: Apache distributed log @ q con 2017

Log Stream

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

Page 6: Apache distributed log @ q con 2017

Sequence Numbers - DLSN

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

DLSN - System Sequence Number

Page 7: Apache distributed log @ q con 2017

Sequence Numbers - Transaction ID

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

DLSN - System Sequence Number

Transaction ID - Application Sequence Number

E.g. Offset or Timestamp

Page 8: Apache distributed log @ q con 2017

Sequence Numbers - Sequence ID

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

DLSN - System Sequence Number

Transaction ID - Application Sequence Number

E.g. Offset or Timestamp

Sequence ID

Page 9: Apache distributed log @ q con 2017

Writer & Readers

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

New records added here

Tailing Reads(close to head of stream)

Catching-up Reads(rewind to any positions)

Page 10: Apache distributed log @ q con 2017

Read Parallelism

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

Read from multiple positions in parallel

Page 11: Apache distributed log @ q con 2017

Log Segments

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

Log SegmentX

Log SegmentX+1

Log SegmentX+2

Page 12: Apache distributed log @ q con 2017

Log Segment Store

1 2 3 4 5 6 7 11 12

13

14

15

16

17

Oldest Newest

Log SegmentX

Log SegmentX+1

Log SegmentX+2

Apache BookKeeper

Page 13: Apache distributed log @ q con 2017

Log Stream Metadata

1 2 3 4 5 6 7 11

12

13

14

15

16

17

Oldest Newest

Writer Reader

Reader

Reader

- List of segments- Transaction Id Index- Truncation point- ...

Stream Metadata

Updates Notifications

Page 14: Apache distributed log @ q con 2017

Namespace

1 2 3 4 5 6 7 11

12

13

14

15

16

17

Oldest Newest

Writer Reader

Reader

Reader

/manhattan/stream-x.../ads/stream_xxx/ads/stream_yyy

Namespace

Lookup

Page 15: Apache distributed log @ q con 2017

ArchitectureM

etad

ata

Stor

e

Log SegmentStore(BK)

- Segments

Page 16: Apache distributed log @ q con 2017

ArchitectureM

etad

ata

Stor

e

Log SegmentStore(BK)

Log Streams - Abstraction & Naming- Data Management

- Efficient Write & Read- Intra-cluster & Geo Replication

- Segments

- Raw Streams

Page 17: Apache distributed log @ q con 2017

ArchitectureM

etad

ata

Stor

e

Log SegmentStore(BK)

Log Streams - Abstraction & Naming- Data Management

- Efficient Write & Read- Intra-cluster & Geo Replication

- Segments

- Raw Streams

WriteProxy

ReadProxy

- Ownership Tracking- Batching, Compression

Record Cache -Rate Limiting, Quota -

- Serving

Page 18: Apache distributed log @ q con 2017

ArchitectureM

etad

ata

Stor

e

Log SegmentStore(BK)

ColdStorage(HDFS)

Log Streams - Abstraction & Naming- Data Management

- Efficient Write & Read- Intra-cluster & Geo Replication

- Segments

- Raw Streams

WriteProxy

ReadProxy

- Ownership Tracking- Batching, Compression

Record Cache -Rate Limiting, Quota -

- Serving

Page 19: Apache distributed log @ q con 2017

Data Flow

WriteClient

WriteProxy Bookie

Bookie

Bookie

ReadProxy

ReadClient

ReadClient

ReadClient

1. write records

4. acknowledge

2. transmit buffer

3. Flush -Write a batched entry to bookies

5. Commit -Write Control

Record6. Long poll read

7. Speculative Read

8. Cache Records

9. Long poll read

Page 20: Apache distributed log @ q con 2017

Consensus

Page 21: Apache distributed log @ q con 2017

Consensus - Primary Leader Approach

Page 22: Apache distributed log @ q con 2017

Consensus - Log Replication

Page 23: Apache distributed log @ q con 2017

Consensus - Safety Ensurance● Election Safety - CAS operation on metadata store

○ Log Segment Sequence Number monotonically increase○ A log segment sequence number is guaranteed to only hand over to a

writer once● Log Segment Append-Only

○ A writer can only append entries to the log segment that is allocated to it

● Fencing - Termination mechanism of a log segment○ No entries can be appended to a log segment if it is fenced

Page 24: Apache distributed log @ q con 2017

User Cases

Page 25: Apache distributed log @ q con 2017

ArchitectureM

etad

ata

Stor

e

Log SegmentStore(BK)

ColdStorage(HDFS)

Log Streams - Abstraction & Naming- Data Management

- Efficient Write & Read- Intra-cluster & Geo Replication

- Segments

- Raw Streams

WriteProxy

ReadProxy

- Ownership Tracking- Batching, Compression

Record Cache -Rate Limiting, Quota -

- Serving

- Applications

- Different

Consumer

models

DBs - e.g.,Twitter’s

Manhattan

DeferredRPC

(queuing)

Self-servePub/Sub

StreamComputing

Cross DCReplication

Page 26: Apache distributed log @ q con 2017

DatabaseStronger Consistency

Page 27: Apache distributed log @ q con 2017

Stronger Consistency in Manhattan

MHCoordinator

MHCoordinator

MHCoordinator

1 2 3 4 5 6 7 11

12

13

14

15

16

17

Oldest Newest

MHReplica

MHReplica

MHReplica

1

2

3

Page 28: Apache distributed log @ q con 2017

Self-Serve Pub/SubMessage Delivery

Page 29: Apache distributed log @ q con 2017

Topic

Partitioned Pub/Sub

1 2 3 4 5 6 7 11

12

13

14

15

16

17

18

19

20

21

22

1 2 3 4 5 6 7 11

12

13

14

15

16

17

18

19

20

21

22

1 2 3 4 5 6 7 11

12

13

14

15

16

17

18

19

20

21

22

New messages appended here

Reads from anyposition- last position stored in offset store

- rewind to any positions-rewind by time (e.g 15 mins ago)

Page 30: Apache distributed log @ q con 2017

Deferred RPCReliable Queuing

Page 31: Apache distributed log @ q con 2017

Reliable RPC System

E D E A E D A A E D E A E D A E D E A E D A

WebServer

RPCQueue

RPCWorker

RPCWorker

Service A

Service B

Service C1

2

3

4

Page 32: Apache distributed log @ q con 2017

Scale at Twitter

Page 33: Apache distributed log @ q con 2017

Performance - Basic (GCP)● Disk & Network Bound● 1 Journal Disk + 5 Ledger Diks● Each Disk can write/read at ~220MB/second● 6 log streams, 1 write proxy + 3 bookies● 1 writer + 1 tailing reader => 2 million records/second● 3 catch-up raders => 7.5 million records/second● End-to-End Latency : within 30ms when network is

around 30% untilized

Page 34: Apache distributed log @ q con 2017

Performance - Effect of Record Size

Page 35: Apache distributed log @ q con 2017

Applications at Twitter● Manhattan Key/Value Store - Stronger Consistency● Durable Deferred RPC - Journal● Real-time search indexing - Change propagation● Self-serve Pub/Sub - Message Delivery, Ads Pipeline● Stream Computing

○ Source & Sink○ Stateful Processing in Heron (coming soon)

● Reliable cross datacenter replication● ...

Page 36: Apache distributed log @ q con 2017

Scale at Twitter● O(1) trillion records per day, O(10) petabytes per day

● O(10) thousands streams, O(1) million live log segments

● O(10^2) bookies, O(10^3) proxies

● Record size from 100 bytes to 20KB to even more

● Data is kept from hours to days, even up to a year

Page 37: Apache distributed log @ q con 2017

Future

Page 38: Apache distributed log @ q con 2017

Not Just Messaging● Stream - Events between services

○ Persistent

○ Rewindable

○ Replayable

○ Time independent

● Unification of Messaging and Storage

Page 39: Apache distributed log @ q con 2017

Apache DistributedLog (incubating)● Open sourced on 05/09/2016.● Landed at Apache Incubator on 06/25/2016.● Website

○ http://distributedlog.io/○ http://incubator.apache.org/projects/distributedlog.ht

ml● Code -

https://github.com/apache/incubator-distributedlog

Page 40: Apache distributed log @ q con 2017

Apache DistributedLog (incubating)● Mail List -

[email protected]● Jira - https://issues.apache.org/jira/browse/DL● Project Ideas -

https://cwiki.apache.org/confluence/display/DL/Project+Ideas

● Paper: “DistributedLog: A high performance replicated log service” (ICDE 2017)

Page 41: Apache distributed log @ q con 2017

Q/A● Twitter: @sijieg

● Email: [email protected]

Page 42: Apache distributed log @ q con 2017

Appendix● Kafka vs DistributedLog

Page 43: Apache distributed log @ q con 2017

Kafka vs DL - Overall

Page 44: Apache distributed log @ q con 2017

Kafka vs DL - Data Segmentation

Page 45: Apache distributed log @ q con 2017

Kafka vs DL - Data Retention● Kafka

○ Time based Retention○ Log compaction by keys

● DL○ Time based Retention (messaging)○ Explicit truncation (database, replicated state machines)

Page 46: Apache distributed log @ q con 2017

Kafka vs DL - Cluster Expand● Kafka - Partition Rebalance

○ Adding new brokers○ Partitions outgrow of brokers’ capacity○ Adding new partitions

● DL○ New log segments will automatically allocated to new

storage nodes○ Scaling proxies (cpu, memory) independent of scaling

storage

Page 47: Apache distributed log @ q con 2017

Kafka vs DL - Writer● Kafka

○ Multiple-Writers Semantic via Brokers● DL

○ Multiple-Writers Semantic via Write Proxies (messaging)

○ Single-Writer Semantic using Core Library (database, replicated state machines)■ Fencing, Exclusive Writer

Page 48: Apache distributed log @ q con 2017

Kafka vs DL - Reader● Kafka

○ Both writes and reads are served by the leader brokers○ Polling

● DL○ Reads from any storage replicas○ Long poll + Speculative Reads

Page 49: Apache distributed log @ q con 2017

Kafka vs DL - Replication Scheme● Kafka

○ ISR Replication○ Follower brokers catchup with Leader broker

● DL○ Quorum-Vote Replication○ Ack Quorum is adjustable○ Replication Repair

Page 50: Apache distributed log @ q con 2017

Kafka vs DL - Storage/Durability● Kafka

○ File (set of files) per partition○ Only write to filesystem page cache

● DL (BookKeeper)○ Interleaved Storage○ All writes are persisted to disk via explicit fsync before

acknowledges○ Physical I/O Isolation