Work Stealing for Interacve Services to Meet Target Latencylu/cse520s/slides/tail_control.pdfme 110 Small requests are waing. Full parallelism does not always work well Target latency:

WorkStealingforInterac1veServicestoMeetTargetLatency

JingLi∗,KunalAgrawal∗,SamehElnikety†,YuxiongHe†,I-TingAngelinaLee∗,ChenyangLu∗,KathrynS.McKinley†∗WashingtonUniversityinSt.Louis†MicrosoFResearch

＊ThisworkandwasiniIatedandpartlydoneduringJingLi’sinternshipatMicrosoFResearchinsummer2014.

Interac1veservicesmustmeetatargetlatency

Interactive services Search, ads, games, finance

Users demand responsiveness

Interac1veservicesmustmeetatargetlatency

Interactive services Search, ads, games, finance

Users demand responsiveness

Problem settingMultiple requests arrive over time Each request: parallelizable

Latency = completion time – arrival timeIts latency should be less than a target latency T

Goal: maximize the number of requests that meet �a target latency T

LatencyinInternetsearch

Ø  In industrial interactive services, thousands of servers together serve a single user query.

Ø  End-to-end latency ≥ latency of the slowest server

Doclookup&ranking

...

Parsingasearchquery

Doclookup&ranking

Doclookup&ranking

Resultaggrega1on&snippetgenera1on

end-to-endresponseIme(~100msforusertofindresponsive)

Targetlatency

Goal—MeetTargetLatencyinSingleServer

Ø Goal – design a scheduler to maximize the number of requests that can be completed within the target latency �(in a single server)

Doclookup&ranking

...

Parsingasearchquery

Doclookup&ranking

Doclookup&ranking

Resultaggrega1on&snippetgenera1on

Targetlatency

Large request must execute in parallel to meet target latency constraint

Targetlatency

RequestSequen1alExecu1onTime(ms)(work)

Sequen1alexecu1onisinsufficient

Fullparallelismdoesnotalwaysworkwell

Target latency: 90ms270

60

Largerequest

Smallrequest


Target latency: 90msCase 1: 1largerequest+3smallrequests 270

0

60

60

60

20 1me

Finishby1me90

Finishby1me110


Target latency: 90msCase 1: 1largerequest+3smallrequests

1me

core1

core2

core3

130150

90110

✖ Miss2requests

270

0

60

60

60

20

Finishby1me90

Finishby1me110

Smallrequestsarewai1ng


Target latency: 90msCase 1: 1largerequest+3smallrequests

1me

core1

core2

core3

130150

90110

core1

core2

core3

50 110 1me80 270

✔ Miss1request✖ Miss2requests

270

0

60

60

60

20

Finishby1me90

Finishby1me110

Somelargerequestsrequireparallelism

Target latency: 90msCase 2: 1largerequest+1smallrequest 270

0

60

20 1me

Finishby1me90

Finishby1me110

Somelargerequestsrequireparallelism

Target latency: 90msCase 2: 1largerequest+1smallrequest

1me

core1

core2

core3

90110

270

0

60

20 1me

Finishby1me90

Finishby1me110

core1

core2

core3

80 1me270

✖ Miss1request✔ Miss0request

Strategy:adaptschedulingtoload

Case 1 �

Cannot afford to run all large �requests in parallel

Case 2

Do need to run some large �requests in parallel

1me

core1

core2

core3

90110

✔ Miss0request

core1

core2

core3

50 110 1me80 270

✔ Miss1request

Strategy:adaptschedulingtoload

High load run large requests sequentially�

Cannot afford to run all large �requests in parallel

Low load run all requests in parallel

Do need to run some large �requests in parallel

1me

core1

core2

core3

90110

✔ Miss0request

core1

core2

core3

50 110 1me80 270

✔ Miss1request

Latency = Processing Time + Waiting time

At low load, processing time dominates latencyq  Parallel execution reduces request processing time

q  All requests run in parallel

At high load, waiting time dominates latencyq  Executing a large request in parallel increases waiting time of

many more later arriving requestsq  Each large request that is sacrificed helps to reduce waiting time

of many more later arriving requests

Whydoestheadap1vestrategywork?

Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially

Challenge:whichrequesttosacrifice?


Challenge 1 non-clairvoyantq  We do not know the work of a request when it arrives

Challenge 2 no accurate definition of large requestsq  Large is relative to instantaneous load



Challenge 1 non-clairvoyantq  We do not know the work of a request when it arrives

Challenge 2 no accurate definition of large requestsq  Large is relative to instantaneous load

q  load = 10, large request >180ms� load = 20, large request > 80ms� load = 30, large request > 20ms


Contribu1ons

Tail-control offline threshold calculation

Tail-control online runtime

Tail-control scheduler

Contribu1ons



Input

Tail-control schedulerTargetlatencyTRequestworkdistribuIon

AvailableinhighlyengineeredinteracIveservices

Requestpersecond(RPS)

Contribu1ons



Input

Large request threshold table

Compute a large request threshold for each load value


Contribu1ons



Input

Use threshold table to decide which request to serialize



Contribu1ons

We modify work stealing to implement tail-control scheduling using Intel Thread Building Block

Be\erperformance

Contribu1ons



Input

Implementation details in the paper



Tail-controlscheduler

Tail-control

offline threshold calculation

Tail-control online

runtime

InputThreshold table

Runtime functionalities:q  Execute all requests in parallel to begin with

q  Record total amount of computation time spent on each request thus far

q  Detect large requests based on the current threshold and current processing time

q  Serializes large requests to limit their impact on other waiting requests

WorkStealingforSingleRequest

Ø  Workers’ local queuesq  Execute work, if there is any in local queue

q  Steal Workers

1

2

A

A

3

execute

parallelize

GeneralizeWorkStealingtoMul1pleReq.

Ø  Workers’ local queues + a global queueq  Execute work, if there is any in local queue

q  Steal – further parallelize a request

q  Admit – start executing a new request

Parallelizablerequestsarriveatglobalqueue

Workers

C B

1

2

A

A

3

execute

parallelizeadmit

ImplementTail-ControlinTBB

Ø  Workers’ local queues + a global queueq  Execute work, if there is any in local queue

q  Steal – further parallelize a request

q  Admit – start executing a new request

Ø  Steal-first (try to reduce processing time)Ø Admit-first (try to reduce waiting time)Ø  Tail-control

q  Steal-first + long request detection & serialization

Parallelizablerequestsarriveatglobalqueue

Workers

C B

1

2

A

A

3

execute

parallelizeadmit

Evalua1on

Ø  Various request work distributions q  Bing search

q  Finance server

q  Log-normal

Ø Different request arrivalq  Poisson

q  Log-normal

Ø  Each setting:100,000 requests, plot target latency miss ratioØ  Two baselines (generalized from work stealing for single job)

q  Steal-first: tries to parallelize requests and reduce proc timeq Admit-first: tries to admit requests and reduce waiting time

Improvementintargetlatencymissra1o

Be\erperformance

HardàEasytomeetthetargetlatency


Be\erperformance

HardàEasytomeetthetargetlatency

Rela1veload:highàlow

Admit-firstwins Steal-firstwins


Be\erperformance

Theinnerworkingsoftail-control

TargetLatency

Theinnerworkingsoftail-control

Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency.

TargetLatency

Tail-controlperformswellwithinaccurateinput

Tail-controlperformswellwithinaccurateinput

Slightly inaccurate input work distribution is still useful

lessàmoreinaccurateinputworkdistribu1on

Relatedwork

Parallelizing single job to reduce latencyq  [Blumofe et al. 1995], [Arora et al. 2001], [Jung et al. 2005], [Ko

et al. 2002], [Wang and O’Boyle 2009], …

Interactive server parallelism optimizing for mean response time and tail latency

q  [Raman et al. 2011], [Jeon et al. 2013], [Kim et al. 2015], [Haque et al. 2015]

Theoretical results of server scheduling for sequential and parallel jobs optimizing for mean and maximum response time

q  [Chekuri et al. 2004], [Torng and McCullough 2008], [Fox and Moseley 2011], [Becchetti et al. 2006], [Kalyanasundaram and Pruhs 1995], [Edmonds and Pruhs 2012], [Agrawal et al. 2016]

Takehome

q For non-clairvoyant interactive services, work distribution is helpful for designing schedulers

q Given the work distribution, we can devise an offline threshold calculation algorithm to compute large request threshold for every value of instantaneous load.

q We have developed an adaptive scheduler that serializes requests according to the threshold table and demonstrated that it works well in practice.

Tail-control offline threshold

calculation

Tail-control online

runtime

InputThreshold table

Documents

Work Stealing for Interacve Services to Meet Target Latencylu/cse520s/slides/tail_control.pdfme 110 Small requests are waing. Full parallelism does not always work well Target latency: