Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
WorkStealingforInterac1veServicestoMeetTargetLatency
JingLi∗,KunalAgrawal∗,SamehElnikety†,YuxiongHe†,I-TingAngelinaLee∗,ChenyangLu∗,KathrynS.McKinley†∗WashingtonUniversityinSt.Louis†MicrosoFResearch
*ThisworkandwasiniIatedandpartlydoneduringJingLi’sinternshipatMicrosoFResearchinsummer2014.
Interac1veservicesmustmeetatargetlatency
Interactive services Search, ads, games, finance
Users demand responsiveness
Interac1veservicesmustmeetatargetlatency
Interactive services Search, ads, games, finance
Users demand responsiveness
Problem settingMultiple requests arrive over time Each request: parallelizable
Latency = completion time – arrival timeIts latency should be less than a target latency T
Goal: maximize the number of requests that meet �a target latency T
LatencyinInternetsearch
Ø In industrial interactive services, thousands of servers together serve a single user query.
Ø End-to-end latency ≥ latency of the slowest server
Doclookup&ranking
...
Parsingasearchquery
Doclookup&ranking
Doclookup&ranking
Resultaggrega1on&snippetgenera1on
end-to-endresponseIme(~100msforusertofindresponsive)
Targetlatency
Goal—MeetTargetLatencyinSingleServer
Ø Goal – design a scheduler to maximize the number of requests that can be completed within the target latency �(in a single server)
Doclookup&ranking
...
Parsingasearchquery
Doclookup&ranking
Doclookup&ranking
Resultaggrega1on&snippetgenera1on
Targetlatency
Large request must execute in parallel to meet target latency constraint
Targetlatency
RequestSequen1alExecu1onTime(ms)(work)
Sequen1alexecu1onisinsufficient
Fullparallelismdoesnotalwaysworkwell
Target latency: 90ms270
60
Largerequest
Smallrequest
Fullparallelismdoesnotalwaysworkwell
Target latency: 90msCase 1: 1largerequest+3smallrequests 270
0
60
60
60
20 1me
Finishby1me90
Finishby1me110
Fullparallelismdoesnotalwaysworkwell
Target latency: 90msCase 1: 1largerequest+3smallrequests
1me
core1
core2
core3
130150
90110
✖ Miss2requests
270
0
60
60
60
20
Finishby1me90
Finishby1me110
Smallrequestsarewai1ng
Fullparallelismdoesnotalwaysworkwell
Target latency: 90msCase 1: 1largerequest+3smallrequests
1me
core1
core2
core3
130150
90110
core1
core2
core3
50 110 1me80 270
✔ Miss1request✖ Miss2requests
270
0
60
60
60
20
Finishby1me90
Finishby1me110
Somelargerequestsrequireparallelism
Target latency: 90msCase 2: 1largerequest+1smallrequest 270
0
60
20 1me
Finishby1me90
Finishby1me110
Somelargerequestsrequireparallelism
Target latency: 90msCase 2: 1largerequest+1smallrequest
1me
core1
core2
core3
90110
270
0
60
20 1me
Finishby1me90
Finishby1me110
core1
core2
core3
80 1me270
✖ Miss1request✔ Miss0request
Strategy:adaptschedulingtoload
Case 1 �
Cannot afford to run all large �requests in parallel
Case 2
Do need to run some large �requests in parallel
1me
core1
core2
core3
90110
✔ Miss0request
core1
core2
core3
50 110 1me80 270
✔ Miss1request
Strategy:adaptschedulingtoload
High load run large requests sequentially�
Cannot afford to run all large �requests in parallel
Low load run all requests in parallel
Do need to run some large �requests in parallel
1me
core1
core2
core3
90110
✔ Miss0request
core1
core2
core3
50 110 1me80 270
✔ Miss1request
Latency = Processing Time + Waiting time
At low load, processing time dominates latencyq Parallel execution reduces request processing time
q All requests run in parallel
At high load, waiting time dominates latencyq Executing a large request in parallel increases waiting time of
many more later arriving requestsq Each large request that is sacrificed helps to reduce waiting time
of many more later arriving requests
Whydoestheadap1vestrategywork?
Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially
Challenge:whichrequesttosacrifice?
Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially
Challenge 1 non-clairvoyantq We do not know the work of a request when it arrives
Challenge 2 no accurate definition of large requestsq Large is relative to instantaneous load
Challenge:whichrequesttosacrifice?
Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially
Challenge 1 non-clairvoyantq We do not know the work of a request when it arrives
Challenge 2 no accurate definition of large requestsq Large is relative to instantaneous load
q load = 10, large request >180ms� load = 20, large request > 80ms� load = 30, large request > 20ms
Challenge:whichrequesttosacrifice?
Contribu1ons
Tail-control offline threshold calculation
Tail-control online runtime
Tail-control scheduler
Contribu1ons
Tail-control offline threshold calculation
Tail-control online runtime
Input
Tail-control schedulerTargetlatencyTRequestworkdistribuIon
AvailableinhighlyengineeredinteracIveservices
Requestpersecond(RPS)
Contribu1ons
Tail-control offline threshold calculation
Tail-control online runtime
Input
Large request threshold table
Compute a large request threshold for each load value
Tail-control scheduler
Contribu1ons
Tail-control offline threshold calculation
Tail-control online runtime
Input
Use threshold table to decide which request to serialize
Tail-control scheduler
Large request threshold table
Contribu1ons
We modify work stealing to implement tail-control scheduling using Intel Thread Building Block
Be\erperformance
Contribu1ons
Tail-control offline threshold calculation
Tail-control online runtime
Input
Implementation details in the paper
Tail-control scheduler
Large request threshold table
Tail-controlscheduler
Tail-control
offline threshold calculation
Tail-control online
runtime
InputThreshold table
Runtime functionalities:q Execute all requests in parallel to begin with
q Record total amount of computation time spent on each request thus far
q Detect large requests based on the current threshold and current processing time
q Serializes large requests to limit their impact on other waiting requests
WorkStealingforSingleRequest
Ø Workers’ local queuesq Execute work, if there is any in local queue
q Steal Workers
1
2
A
A
3
execute
parallelize
GeneralizeWorkStealingtoMul1pleReq.
Ø Workers’ local queues + a global queueq Execute work, if there is any in local queue
q Steal – further parallelize a request
q Admit – start executing a new request
Parallelizablerequestsarriveatglobalqueue
Workers
C B
1
2
A
A
3
execute
parallelizeadmit
ImplementTail-ControlinTBB
Ø Workers’ local queues + a global queueq Execute work, if there is any in local queue
q Steal – further parallelize a request
q Admit – start executing a new request
Ø Steal-first (try to reduce processing time)Ø Admit-first (try to reduce waiting time)Ø Tail-control
q Steal-first + long request detection & serialization
Parallelizablerequestsarriveatglobalqueue
Workers
C B
1
2
A
A
3
execute
parallelizeadmit
Evalua1on
Ø Various request work distributions q Bing search
q Finance server
q Log-normal
Ø Different request arrivalq Poisson
q Log-normal
Ø Each setting:100,000 requests, plot target latency miss ratioØ Two baselines (generalized from work stealing for single job)
q Steal-first: tries to parallelize requests and reduce proc timeq Admit-first: tries to admit requests and reduce waiting time
Improvementintargetlatencymissra1o
Be\erperformance
HardàEasytomeetthetargetlatency
Improvementintargetlatencymissra1o
Be\erperformance
HardàEasytomeetthetargetlatency
Rela1veload:highàlow
Admit-firstwins Steal-firstwins
Improvementintargetlatencymissra1o
Be\erperformance
Theinnerworkingsoftail-control
TargetLatency
Theinnerworkingsoftail-control
Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency.
TargetLatency
Theinnerworkingsoftail-control
Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency.
TargetLatency
Theinnerworkingsoftail-control
Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency.
TargetLatency
Tail-controlperformswellwithinaccurateinput
Tail-controlperformswellwithinaccurateinput
Slightly inaccurate input work distribution is still useful
lessàmoreinaccurateinputworkdistribu1on
Relatedwork
Parallelizing single job to reduce latencyq [Blumofe et al. 1995], [Arora et al. 2001], [Jung et al. 2005], [Ko
et al. 2002], [Wang and O’Boyle 2009], …
Interactive server parallelism optimizing for mean response time and tail latency
q [Raman et al. 2011], [Jeon et al. 2013], [Kim et al. 2015], [Haque et al. 2015]
Theoretical results of server scheduling for sequential and parallel jobs optimizing for mean and maximum response time
q [Chekuri et al. 2004], [Torng and McCullough 2008], [Fox and Moseley 2011], [Becchetti et al. 2006], [Kalyanasundaram and Pruhs 1995], [Edmonds and Pruhs 2012], [Agrawal et al. 2016]
Takehome
q For non-clairvoyant interactive services, work distribution is helpful for designing schedulers
q Given the work distribution, we can devise an offline threshold calculation algorithm to compute large request threshold for every value of instantaneous load.
q We have developed an adaptive scheduler that serializes requests according to the threshold table and demonstrated that it works well in practice.
Tail-control offline threshold
calculation
Tail-control online
runtime
InputThreshold table