40
Work Stealing for Interac1ve Services to Meet Target Latency Jing Li, Kunal Agrawal, Sameh Elnikety†, Yuxiong He†, I-Ting Angelina Lee, Chenyang Lu, Kathryn S. McKinley† Washington University in St. Louis †MicrosoF Research This work and was iniIated and partly done during Jing Li’s internship at MicrosoF Research in summer 2014.

Work Stealing for Interacve Services to Meet Target Latencylu/cse520s/slides/tail_control.pdfme 110 Small requests are waing. Full parallelism does not always work well Target latency:

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • WorkStealingforInterac1veServicestoMeetTargetLatency

    JingLi∗,KunalAgrawal∗,SamehElnikety†,YuxiongHe†,I-TingAngelinaLee∗,ChenyangLu∗,KathrynS.McKinley†∗WashingtonUniversityinSt.Louis†MicrosoFResearch

    *ThisworkandwasiniIatedandpartlydoneduringJingLi’sinternshipatMicrosoFResearchinsummer2014.

  • Interac1veservicesmustmeetatargetlatency

    Interactive services Search, ads, games, finance

    Users demand responsiveness

  • Interac1veservicesmustmeetatargetlatency

    Interactive services Search, ads, games, finance

    Users demand responsiveness

    Problem settingMultiple requests arrive over time Each request: parallelizable

    Latency = completion time – arrival timeIts latency should be less than a target latency T

    Goal: maximize the number of requests that meet �a target latency T

  • LatencyinInternetsearch

    Ø  In industrial interactive services, thousands of servers together serve a single user query.

    Ø  End-to-end latency ≥ latency of the slowest server

    Doclookup&ranking

    ...

    Parsingasearchquery

    Doclookup&ranking

    Doclookup&ranking

    Resultaggrega1on&snippetgenera1on

    end-to-endresponseIme(~100msforusertofindresponsive)

    Targetlatency

  • Goal—MeetTargetLatencyinSingleServer

    Ø Goal – design a scheduler to maximize the number of requests that can be completed within the target latency �(in a single server)

    Doclookup&ranking

    ...

    Parsingasearchquery

    Doclookup&ranking

    Doclookup&ranking

    Resultaggrega1on&snippetgenera1on

    Targetlatency

  • Large request must execute in parallel to meet target latency constraint

    Targetlatency

    RequestSequen1alExecu1onTime(ms)(work)

    Sequen1alexecu1onisinsufficient

  • Fullparallelismdoesnotalwaysworkwell

    Target latency: 90ms270

    60

    Largerequest

    Smallrequest

  • Fullparallelismdoesnotalwaysworkwell

    Target latency: 90msCase 1: 1largerequest+3smallrequests 270

    0

    60

    60

    60

    20 1me

    Finishby1me90

    Finishby1me110

  • Fullparallelismdoesnotalwaysworkwell

    Target latency: 90msCase 1: 1largerequest+3smallrequests

    1me

    core1

    core2

    core3

    130150

    90110

    ✖ Miss2requests

    270

    0

    60

    60

    60

    20

    Finishby1me90

    Finishby1me110

    Smallrequestsarewai1ng

  • Fullparallelismdoesnotalwaysworkwell

    Target latency: 90msCase 1: 1largerequest+3smallrequests

    1me

    core1

    core2

    core3

    130150

    90110

    core1

    core2

    core3

    50 110 1me80 270

    ✔ Miss1request✖ Miss2requests

    270

    0

    60

    60

    60

    20

    Finishby1me90

    Finishby1me110

  • Somelargerequestsrequireparallelism

    Target latency: 90msCase 2: 1largerequest+1smallrequest 270

    0

    60

    20 1me

    Finishby1me90

    Finishby1me110

  • Somelargerequestsrequireparallelism

    Target latency: 90msCase 2: 1largerequest+1smallrequest

    1me

    core1

    core2

    core3

    90110

    270

    0

    60

    20 1me

    Finishby1me90

    Finishby1me110

    core1

    core2

    core3

    80 1me270

    ✖ Miss1request✔ Miss0request

  • Strategy:adaptschedulingtoload

    Case 1 �

    Cannot afford to run all large �requests in parallel

    Case 2

    Do need to run some large �requests in parallel

    1me

    core1

    core2

    core3

    90110

    ✔ Miss0request

    core1

    core2

    core3

    50 110 1me80 270

    ✔ Miss1request

  • Strategy:adaptschedulingtoload

    High load run large requests sequentially�

    Cannot afford to run all large �requests in parallel

    Low load run all requests in parallel

    Do need to run some large �requests in parallel

    1me

    core1

    core2

    core3

    90110

    ✔ Miss0request

    core1

    core2

    core3

    50 110 1me80 270

    ✔ Miss1request

  • Latency = Processing Time + Waiting time

    At low load, processing time dominates latencyq  Parallel execution reduces request processing time

    q  All requests run in parallel

    At high load, waiting time dominates latencyq  Executing a large request in parallel increases waiting time of

    many more later arriving requestsq  Each large request that is sacrificed helps to reduce waiting time

    of many more later arriving requests

    Whydoestheadap1vestrategywork?

  • Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially

    Challenge:whichrequesttosacrifice?

  • Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially

    Challenge 1 non-clairvoyantq  We do not know the work of a request when it arrives

    Challenge 2 no accurate definition of large requestsq  Large is relative to instantaneous load

    Challenge:whichrequesttosacrifice?

  • Strategy: when load is low, run all requests in parallel; when load is high, run large requests sequentially

    Challenge 1 non-clairvoyantq  We do not know the work of a request when it arrives

    Challenge 2 no accurate definition of large requestsq  Large is relative to instantaneous load

    q  load = 10, large request >180ms� load = 20, large request > 80ms� load = 30, large request > 20ms

    Challenge:whichrequesttosacrifice?

  • Contribu1ons

    Tail-control offline threshold calculation

    Tail-control online runtime

    Tail-control scheduler

  • Contribu1ons

    Tail-control offline threshold calculation

    Tail-control online runtime

    Input

    Tail-control schedulerTargetlatencyTRequestworkdistribuIon

    AvailableinhighlyengineeredinteracIveservices

    Requestpersecond(RPS)

  • Contribu1ons

    Tail-control offline threshold calculation

    Tail-control online runtime

    Input

    Large request threshold table

    Compute a large request threshold for each load value

    Tail-control scheduler

  • Contribu1ons

    Tail-control offline threshold calculation

    Tail-control online runtime

    Input

    Use threshold table to decide which request to serialize

    Tail-control scheduler

    Large request threshold table

  • Contribu1ons

    We modify work stealing to implement tail-control scheduling using Intel Thread Building Block

    Be\erperformance

  • Contribu1ons

    Tail-control offline threshold calculation

    Tail-control online runtime

    Input

    Implementation details in the paper

    Tail-control scheduler

    Large request threshold table

  • Tail-controlscheduler

    Tail-control

    offline threshold calculation

    Tail-control online

    runtime

    InputThreshold table

    Runtime functionalities:q  Execute all requests in parallel to begin with

    q  Record total amount of computation time spent on each request thus far

    q  Detect large requests based on the current threshold and current processing time

    q  Serializes large requests to limit their impact on other waiting requests

  • WorkStealingforSingleRequest

    Ø  Workers’ local queuesq  Execute work, if there is any in local queue

    q  Steal Workers

    1

    2

    A

    A

    3

    execute

    parallelize

  • GeneralizeWorkStealingtoMul1pleReq.

    Ø  Workers’ local queues + a global queueq  Execute work, if there is any in local queue

    q  Steal – further parallelize a request

    q  Admit – start executing a new request

    Parallelizablerequestsarriveatglobalqueue

    Workers

    C B

    1

    2

    A

    A

    3

    execute

    parallelizeadmit

  • ImplementTail-ControlinTBB

    Ø  Workers’ local queues + a global queueq  Execute work, if there is any in local queue

    q  Steal – further parallelize a request

    q  Admit – start executing a new request

    Ø  Steal-first (try to reduce processing time)Ø Admit-first (try to reduce waiting time)Ø  Tail-control

    q  Steal-first + long request detection & serialization

    Parallelizablerequestsarriveatglobalqueue

    Workers

    C B

    1

    2

    A

    A

    3

    execute

    parallelizeadmit

  • Evalua1on

    Ø  Various request work distributions q  Bing search

    q  Finance server

    q  Log-normal

    Ø Different request arrivalq  Poisson

    q  Log-normal

    Ø  Each setting:100,000 requests, plot target latency miss ratioØ  Two baselines (generalized from work stealing for single job)

    q  Steal-first: tries to parallelize requests and reduce proc timeq Admit-first: tries to admit requests and reduce waiting time

  • Improvementintargetlatencymissra1o

    Be\erperformance

    HardàEasytomeetthetargetlatency

  • Improvementintargetlatencymissra1o

    Be\erperformance

    HardàEasytomeetthetargetlatency

    Rela1veload:highàlow

    Admit-firstwins Steal-firstwins

  • Improvementintargetlatencymissra1o

    Be\erperformance

  • Theinnerworkingsoftail-control

    TargetLatency

  • Theinnerworkingsoftail-control

    Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency.

    TargetLatency

  • Theinnerworkingsoftail-control

    Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency.

    TargetLatency

  • Theinnerworkingsoftail-control

    Tail-control sacrifices few large requests and reduces latency of many more small requests to meet target latency.

    TargetLatency

  • Tail-controlperformswellwithinaccurateinput

  • Tail-controlperformswellwithinaccurateinput

    Slightly inaccurate input work distribution is still useful

    lessàmoreinaccurateinputworkdistribu1on

  • Relatedwork

    Parallelizing single job to reduce latencyq  [Blumofe et al. 1995], [Arora et al. 2001], [Jung et al. 2005], [Ko

    et al. 2002], [Wang and O’Boyle 2009], …

    Interactive server parallelism optimizing for mean response time and tail latency

    q  [Raman et al. 2011], [Jeon et al. 2013], [Kim et al. 2015], [Haque et al. 2015]

    Theoretical results of server scheduling for sequential and parallel jobs optimizing for mean and maximum response time

    q  [Chekuri et al. 2004], [Torng and McCullough 2008], [Fox and Moseley 2011], [Becchetti et al. 2006], [Kalyanasundaram and Pruhs 1995], [Edmonds and Pruhs 2012], [Agrawal et al. 2016]

  • Takehome

    q For non-clairvoyant interactive services, work distribution is helpful for designing schedulers

    q Given the work distribution, we can devise an offline threshold calculation algorithm to compute large request threshold for every value of instantaneous load.

    q We have developed an adaptive scheduler that serializes requests according to the threshold table and demonstrated that it works well in practice.

    Tail-control offline threshold

    calculation

    Tail-control online

    runtime

    InputThreshold table