Upload
aditya-auradkar
View
1.797
Download
2
Embed Size (px)
Citation preview
©2015 LinkedIn Corporation. All Rights Reserved.
Aditya Auradkar & Dong Lin
©2015 LinkedIn Corporation. All Rights Reserved.
Motivation: Why is this important?
● Shared resources in a multi-tenant environment
● Bad clients can hurt others– Bootstrapping consumers
– Buggy clients
● Better QOS for well-behaved clients
● Preserve throughout and latency for everyone else
● API Limits/Billing
©2015 LinkedIn Corporation. All Rights Reserved.
Clients and Client-Ids
● Quotas are enforced per client-id
● Why client-id?
● No quotas per topic
● No quotas per topic * client-id combination
● Blanket produce and fetch quota for all clients
©2015 LinkedIn Corporation. All Rights Reserved.
Quota Overrides
● Certain clients justify higher quotas
● Rolling bounces take too long and require too much effort
● Store overrides in ZooKeeper
● Brokers parse config change notifications
● Apply new quota immediately
©2015 LinkedIn Corporation. All Rights Reserved.
Quota Overrides
{ "version":1,
"config": {
"producer_byte_rate":"1048576",
"consumer_byte_rate":"1048576”
}
}
©2015 LinkedIn Corporation. All Rights Reserved.
Broker Metrics
● Metrics created for each client
● Clients can come and go
● Don’t need to retain client metrics forever
● GC metrics if inactive for longer than 1 hr
● Recreate if client reconnects
©2015 LinkedIn Corporation. All Rights Reserved.
Enforcement
● Reduce client throughput to desired rate
● Compute delay based on current throughput
● Small violations result in small delays
● Use smaller measurement windows to avoid long pauses
● Client side metrics available to detect throttling
©2015 LinkedIn Corporation. All Rights Reserved.
Delay Calculation
● Delay = W * (μ - Q) / μ
● W = window size, μ = observed rate, Q = desired rate
©2015 LinkedIn Corporation. All Rights Reserved.
replica
manager log
quota
manager
Enforcement
producer
r
e
q
u
e
s
t
c
h
a
n
n
e
l
1. request
7. response
3. append
4. record metric
5. delay
delay queue6. dequeue
delay queue
2. process
©2015 LinkedIn Corporation. All Rights Reserved.
replica
manager log
quota
manager
Enforcement
r
e
q
u
e
s
t
c
h
a
n
n
e
l
1. request
7. Response
(zero copy)
3. fetch offsets
4. record metric
delay queue6. dequeue
delay queue
2. process
5. delay
consumer
©2015 LinkedIn Corporation. All Rights Reserved.
Slowdown vs Error
● Error handling is hard
● Tricky to implement backoff and retries
● All client implementations need to handle quota errors
● Need something easier
©2015 LinkedIn Corporation. All Rights Reserved.
Getting Started
● Important Broker configs– quota.producer.default (in bytes/sec)– quota.consumer.default (in bytes/sec)
● Apply overrides./bin/kafka-configs.sh --alter
--add-config 'producer_byte_rate=1048576,consumer_byte_rate=1048576’--entity-type clients--entity-name TestTopic--zookeeper localhost:2181
● Read overrides./bin/kafka-configs.sh --describe
--entity-type clients--entity-name TestTopic--zookeeper localhost:2181
©2015 LinkedIn Corporation. All Rights Reserved.
Monitoring
● Producer metrics– throttle-time avg and max
● Consumer metrics – throttle-time avg and max
● Broker metrics – byte-rate and avg throttle-time per client-id
– byte-rate is used for enforcement
● ZookeeperConsumerConnector and SimpleConsumer metrics also
available
©2015 LinkedIn Corporation. All Rights Reserved.
Rollout Strategy
● Deploy without enforcement
● Monitor metrics to track throughput for all clients
● Identify candidates for overrides
● Start with high thresholds
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation
● Validate quota functionality- broker-throughput <= sum(quota_of_clientid)
- sum(client-throughput) <= quota_of_clientId
● Evaluate performance improvement for clients.- Throughput and latency
- Clients with different throughput demand
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Validate Quota Functionality
● Unlimited quota
producer
consumer
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Validate Quota Functionality
● quota.producer.default = quota.consumer.default = 50 MBps
producer
consumer
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Validate Quota Functionality
● quota.producer.default = quota.consumer.default = 10 MBps
producer
consumer
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Client Performance Improvement
small client
running alone
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Client Performance Improvement
small client
running alone
clients join together
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Client Performance Improvement
small client
running alone
clients join together
clients join
in presence of quota
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Client Performance Improvement
small client
running alone
clients join together
clients join
in presence of quota
comparison
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Producer Performance Improvement
0 100 200 300 400 500 600Time (sec)
0
5
10
15
20
25
30
35
Late
ncy (
ms)
alone
together
quota
Latency (ms)
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Producer Performance Improvement
● Producer runs at 2 MBps alone (alone)
0 100 200 300 400 500 600Time (sec)
0
5
10
15
20
25
30
35
Late
ncy (
ms)
alone
together
quota
Alone
Latency (ms) 1.5
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Producer Performance Improvement
● Producer runs at 2 MBps alone (alone)
● Producer runs with other producers without quota (together)
0 100 200 300 400 500 600Time (sec)
0
5
10
15
20
25
30
35
Late
ncy (
ms)
alone
together
quota
Alone Together
Latency (ms) 1.5 23.6
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Producer Performance Improvement
● Producer runs at 2 MBps alone (alone)
● Producer runs with other producers without quota (together)
● Producer runs with other producers with 10 MBps quota (quota)
0 100 200 300 400 500 600Time (sec)
0
5
10
15
20
25
30
35
Late
ncy (
ms)
alone
together
quota
Alone Together Quota
Latency (ms) 1.5 23.6 2.5
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Consumer Performance Improvement
0 100 200 300 400 500 600Time (sec)
20
30
40
50
60
70
80
90
Thro
ugp
ut (M
Bps)
alone
together
quota
alone_quota
Throughput
(MBps)
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Consumer Performance Improvement
● Consumer runs alone (alone)
● Consumer runs alone with 50 MBps quota (alone-quota)
0 100 200 300 400 500 600Time (sec)
20
30
40
50
60
70
80
90
Thro
ugp
ut (M
Bps)
alone
together
quota
alone_quota
alonealone-
quota
Throughput
(MBps)87 45
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Consumer Performance Improvement
● Consumer runs alone (alone)
● Consumer runs alone with 50 MBps quota (alone-quota)
● Consumer runs with other consumers without quota (together)
0 100 200 300 400 500 600Time (sec)
20
30
40
50
60
70
80
90
Thro
ugp
ut (M
Bps)
alone
together
quota
alone_quota
alonealone-
quotatogether
Throughput
(MBps)87 45 31
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation – Consumer Performance Improvement
● Consumer runs alone (alone)
● Consumer runs alone with 50 MBps quota (alone-quota)
● Consumer runs with other consumers without quota (together)
● Consumer runs with other consumers with 50 MBps quota (quota)
0 100 200 300 400 500 600Time (sec)
20
30
40
50
60
70
80
90
Thro
ugp
ut (M
Bps)
alone
together
quota
alone_quota
alonealone-
quotatogether quota
Throughput
(MBps)87 45 31 40
©2015 LinkedIn Corporation. All Rights Reserved.
Evaluation - Summary
● Quota functionality is enforced
● Performance improvement for clients from quota in the event that large
clients join
©2015 LinkedIn Corporation. All Rights Reserved.
Future Work
● Throttle replica traffic (e.g. during bootstrap)
● Throttle more request types (OffsetCommitRequest etc.)
● Client-id authentication for use in multi-tenancy environment
©2015 LinkedIn Corporation. All Rights Reserved.
Acknowledgements
● LinkedIn Kafka Engineering team
● Confluent Inc
● John McClean (formerly at LI)