Upload
sijie-guo
View
63
Download
0
Embed Size (px)
Citation preview
Building reliable real-time serviceswith Apache DistributedLog
@sijieg
Logs are Everywhere● DB Storage Engines - WAL
● DB Replication - Binlog, Log shipping
● Distributed Consensus - Replicated log
● Messaging/Pub-Sub - Kafka
Apache DistributedLog
Log StreamAn endless, totally ordered,
sequence of immutable records
Log Stream
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
Sequence Numbers - DLSN
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
DLSN - System Sequence Number
Sequence Numbers - Transaction ID
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
DLSN - System Sequence Number
Transaction ID - Application Sequence Number
E.g. Offset or Timestamp
Sequence Numbers - Sequence ID
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
DLSN - System Sequence Number
Transaction ID - Application Sequence Number
E.g. Offset or Timestamp
Sequence ID
Writer & Readers
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
New records added here
Tailing Reads(close to head of stream)
Catching-up Reads(rewind to any positions)
Read Parallelism
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
Read from multiple positions in parallel
Log Segments
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
Log SegmentX
Log SegmentX+1
Log SegmentX+2
Log Segment Store
1 2 3 4 5 6 7 11 12
13
14
15
16
17
Oldest Newest
Log SegmentX
Log SegmentX+1
Log SegmentX+2
Apache BookKeeper
Log Stream Metadata
1 2 3 4 5 6 7 11
12
13
14
15
16
17
Oldest Newest
Writer Reader
Reader
Reader
- List of segments- Transaction Id Index- Truncation point- ...
Stream Metadata
Updates Notifications
Namespace
1 2 3 4 5 6 7 11
12
13
14
15
16
17
Oldest Newest
Writer Reader
Reader
Reader
/manhattan/stream-x.../ads/stream_xxx/ads/stream_yyy
Namespace
Lookup
ArchitectureM
etad
ata
Stor
e
Log SegmentStore(BK)
- Segments
ArchitectureM
etad
ata
Stor
e
Log SegmentStore(BK)
Log Streams - Abstraction & Naming- Data Management
- Efficient Write & Read- Intra-cluster & Geo Replication
- Segments
- Raw Streams
ArchitectureM
etad
ata
Stor
e
Log SegmentStore(BK)
Log Streams - Abstraction & Naming- Data Management
- Efficient Write & Read- Intra-cluster & Geo Replication
- Segments
- Raw Streams
WriteProxy
ReadProxy
- Ownership Tracking- Batching, Compression
Record Cache -Rate Limiting, Quota -
- Serving
ArchitectureM
etad
ata
Stor
e
Log SegmentStore(BK)
ColdStorage(HDFS)
Log Streams - Abstraction & Naming- Data Management
- Efficient Write & Read- Intra-cluster & Geo Replication
- Segments
- Raw Streams
WriteProxy
ReadProxy
- Ownership Tracking- Batching, Compression
Record Cache -Rate Limiting, Quota -
- Serving
Data Flow
WriteClient
WriteProxy Bookie
Bookie
Bookie
ReadProxy
ReadClient
ReadClient
ReadClient
1. write records
4. acknowledge
2. transmit buffer
3. Flush -Write a batched entry to bookies
5. Commit -Write Control
Record6. Long poll read
7. Speculative Read
8. Cache Records
9. Long poll read
Consensus
Consensus - Primary Leader Approach
Consensus - Log Replication
Consensus - Safety Ensurance● Election Safety - CAS operation on metadata store
○ Log Segment Sequence Number monotonically increase○ A log segment sequence number is guaranteed to only hand over to a
writer once● Log Segment Append-Only
○ A writer can only append entries to the log segment that is allocated to it
● Fencing - Termination mechanism of a log segment○ No entries can be appended to a log segment if it is fenced
User Cases
ArchitectureM
etad
ata
Stor
e
Log SegmentStore(BK)
ColdStorage(HDFS)
Log Streams - Abstraction & Naming- Data Management
- Efficient Write & Read- Intra-cluster & Geo Replication
- Segments
- Raw Streams
WriteProxy
ReadProxy
- Ownership Tracking- Batching, Compression
Record Cache -Rate Limiting, Quota -
- Serving
- Applications
- Different
Consumer
models
DBs - e.g.,Twitter’s
Manhattan
DeferredRPC
(queuing)
Self-servePub/Sub
StreamComputing
Cross DCReplication
DatabaseStronger Consistency
Stronger Consistency in Manhattan
MHCoordinator
MHCoordinator
MHCoordinator
1 2 3 4 5 6 7 11
12
13
14
15
16
17
Oldest Newest
MHReplica
MHReplica
MHReplica
1
2
3
Self-Serve Pub/SubMessage Delivery
Topic
Partitioned Pub/Sub
1 2 3 4 5 6 7 11
12
13
14
15
16
17
18
19
20
21
22
1 2 3 4 5 6 7 11
12
13
14
15
16
17
18
19
20
21
22
1 2 3 4 5 6 7 11
12
13
14
15
16
17
18
19
20
21
22
New messages appended here
Reads from anyposition- last position stored in offset store
- rewind to any positions-rewind by time (e.g 15 mins ago)
Deferred RPCReliable Queuing
Reliable RPC System
E D E A E D A A E D E A E D A E D E A E D A
WebServer
RPCQueue
RPCWorker
RPCWorker
Service A
Service B
Service C1
2
3
4
Scale at Twitter
Performance - Basic (GCP)● Disk & Network Bound● 1 Journal Disk + 5 Ledger Diks● Each Disk can write/read at ~220MB/second● 6 log streams, 1 write proxy + 3 bookies● 1 writer + 1 tailing reader => 2 million records/second● 3 catch-up raders => 7.5 million records/second● End-to-End Latency : within 30ms when network is
around 30% untilized
Performance - Effect of Record Size
Applications at Twitter● Manhattan Key/Value Store - Stronger Consistency● Durable Deferred RPC - Journal● Real-time search indexing - Change propagation● Self-serve Pub/Sub - Message Delivery, Ads Pipeline● Stream Computing
○ Source & Sink○ Stateful Processing in Heron (coming soon)
● Reliable cross datacenter replication● ...
Scale at Twitter● O(1) trillion records per day, O(10) petabytes per day
● O(10) thousands streams, O(1) million live log segments
● O(10^2) bookies, O(10^3) proxies
● Record size from 100 bytes to 20KB to even more
● Data is kept from hours to days, even up to a year
Future
Not Just Messaging● Stream - Events between services
○ Persistent
○ Rewindable
○ Replayable
○ Time independent
● Unification of Messaging and Storage
Apache DistributedLog (incubating)● Open sourced on 05/09/2016.● Landed at Apache Incubator on 06/25/2016.● Website
○ http://distributedlog.io/○ http://incubator.apache.org/projects/distributedlog.ht
ml● Code -
https://github.com/apache/incubator-distributedlog
Apache DistributedLog (incubating)● Mail List -
[email protected]● Jira - https://issues.apache.org/jira/browse/DL● Project Ideas -
https://cwiki.apache.org/confluence/display/DL/Project+Ideas
● Paper: “DistributedLog: A high performance replicated log service” (ICDE 2017)
Appendix● Kafka vs DistributedLog
Kafka vs DL - Overall
Kafka vs DL - Data Segmentation
Kafka vs DL - Data Retention● Kafka
○ Time based Retention○ Log compaction by keys
● DL○ Time based Retention (messaging)○ Explicit truncation (database, replicated state machines)
Kafka vs DL - Cluster Expand● Kafka - Partition Rebalance
○ Adding new brokers○ Partitions outgrow of brokers’ capacity○ Adding new partitions
● DL○ New log segments will automatically allocated to new
storage nodes○ Scaling proxies (cpu, memory) independent of scaling
storage
Kafka vs DL - Writer● Kafka
○ Multiple-Writers Semantic via Brokers● DL
○ Multiple-Writers Semantic via Write Proxies (messaging)
○ Single-Writer Semantic using Core Library (database, replicated state machines)■ Fencing, Exclusive Writer
Kafka vs DL - Reader● Kafka
○ Both writes and reads are served by the leader brokers○ Polling
● DL○ Reads from any storage replicas○ Long poll + Speculative Reads
Kafka vs DL - Replication Scheme● Kafka
○ ISR Replication○ Follower brokers catchup with Leader broker
● DL○ Quorum-Vote Replication○ Ack Quorum is adjustable○ Replication Repair
Kafka vs DL - Storage/Durability● Kafka
○ File (set of files) per partition○ Only write to filesystem page cache
● DL (BookKeeper)○ Interleaved Storage○ All writes are persisted to disk via explicit fsync before
acknowledges○ Physical I/O Isolation