PRETZEL:Openingthe BlackBoxof Machine Learning Prediction ...pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-pretzel-qinyuan.pdf · Machine Learning Prediction Serving 1.Models are

PRETZEL:Opening the Black Box of Machine Learning Prediction ServingSystems

Presented by Qinyuan Sun

Slides are modified from first author Yunseong Lee’s slides

2

Outline

•Prediction Serving Systems

•Limitations of Black Box Approaches•PRETZEL: White-box Prediction ServingSystem•Evaluation•Conclusion

Machine Learning PredictionServing

1. Models are learned fromdata2. Modelsare deployed and served together

Predictionserving

Server UsersData

Training

ModelLearn Deploy

Performance goal:1) Low latency2) High throughput3) Minimal resourceusage

2

• Assumption: models are blackbox

• Re-use the same code in training phase

• Encapsulate alloperations

into a function call (e.g.,pred ic t ( ) )

• Apply externaloptimizations

ML Prediction ServingSystems:

State-of-the-art

Clipper TFServing ML.Net

PredictionServing

System

“Pretzel is tasty”

cat

car

Text

Analysis

Image

Recognition

Result

caching ensemble

4

Replication

Request

Batching

How does ModelsLook Like inside Boxes?

<Example: SentimentAnalysis>

Pretzel is tasty(text)

J vs. L(positive vs.negative)

5

Model

How do ModelsLook inside Boxes?

Pretzel is tasty

Ngram

Word Ngram


Tokenizer Concat J vs. L

DAG ofOperatorsFeaturizers

Char Predictor

Logistic Regression

6

Tokenizer

How do ModelsLook inside Boxes?


Pretzel is tasty

Char Ngram

Word Ngram

ConcatLogistic

Regression

DAG ofOperators

J vs. L

Split text into tokens

Extract N-grams

Merge two vectors

Compute final score

7

Many ModelsHave Similar Structures

• Many part of a model can be re-used in other models

• Customer personalization, Templates, TransferLearning

• Identical set of operators with differentparameters

8

9

Outline

•Prediction Serving Systems•Limitations of Black BoxApproaches

•PRETZEL: White-box Prediction ServingSystem•Evaluation•Conclusion

Limitation 1: ResourceWaste

•Resources are isolated across Blackboxes

1. Unable to share memoryspace

è Waste memory to maintain duplicate

objects (despite similarities between

models)

2. Nocoordination for CPU resources between boxes

è Serving many models can use too many threads

machine

10

Tokenizer

Char Ngram

Word Ngram

Concat Log Reg

Limitation 2: Inconsideration for Ops’Characteristics

1. Operators have different performancecharacteristics• Concat materializes avector• LogReg takes only 0.3% (contrary to the training phase)

2. There can be a better plan if such characteristics are considered• Re-use the existingvectors• Apply in-place update inLogReg

0%20

%

100%

40% 60%80%

Latency breakdown

23.1 34.2 32.7 9.6

CharNgram WordNgram Concat LogReg Others 0.3

11

Limitation 3: Lazy Initialization

• ML.Net initializes code and memory lazily (efficient in training phase)

• Run 250 SentimentAnalysis models 100 times

è cold: first execution / hot: average of the rest99

• Long-tail latency in the coldcase

• Code analysis, Just–in-time (JIT) compilation, memory allocation,etc

• Difficult to provide strong Service-Level-Agreement(SLA)

Tokenizer

Char

Ngram

Word

Ngram

ConcatLog

Reg 13x

444x

12

13

Outline

• (Black-box) Prediction ServingSystems•Limitations of Black BoxApproaches•PRETZEL: White-box Prediction ServingSystem

•Evaluation•Conclusion

14

PRETZEL: White-box PredictionServing

•We analyze models to optimize the internal execution

•We let models co-exist on the same runtime,

sharing computation and memoryresources

•We optimize models in twodirections:

1. End-to-end optimizations

2. Multi-model optimizations

15

End-to-End Optimizations

Optimize the execution of individualmodels from start to end

1. [Ahead-of-time Compilation]Compile operators’ code inadvanceà No JIToverhead

2. [Vector pooling]Pre-allocate data structuresà No memory allocation on the data path

16

Multi-model Optimizations

Share computation and memory acrossmodels

1. [Object Store]Share Operators parameters/weightsà Maintain only onecopy

2. [Sub-plan Materialization]

Reuse intermediate results computed by othermodelsà Save computation

System Components

1. Flour: IntermediateRepresentation

2. Oven: Compiler/Optimizer

var fContext = ...; var Tokenizer = ...; return fPrgm.Plan();

3. Runtime: Execute inferencequeries

Runtime

Object Store

Scheduler

…

4. FrontEnd: Handle userrequests

FrontEnd

17

Prediction Serving withPRETZEL

1. Offline• Analyze structural information ofmodels• Build ModelPlan for optimalexecution• Register ModelPlan toRuntime

2. Online• Handle prediction requests• Coordinate CPU & memoryresources

RuntimeFrontEnd

RuntimeRegister

Model

Analyze

18

Model

Plan

System Design: OfflinePhase

Char Ngram

Tokenizer

Word Ngram

Concat Log Reg

1. Translate Model into FlourProgram

<Model>var fContext = new FlourContext(...) var tTokenizer = fContext.CSV

.FromText(fields, fieldsType, sep)

.Tokenize();

var tCNgram = tTokenizer.CharNgram(numCNgrms, ...);var tWNgram = tTokenizer.WordNgram(numWNgrms, ...);var fPrgrm = tCNgram

.Concat(tWNgram)

.ClassifierBinaryLinear(cParams);

<Flour Program>

return fPrgrm.Plan();18


var tCNgram = tTokenizer.CharNgram(numCNgrms,...); var tWNgram =tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm= tCNgram

.Concat(tWNgram)

.ClassifierBinaryLinear(cParams);return fPrgrm.Plan();

2. Oven optimizer/compiler build ModelPlan

<Flour Program>var fContext = new FlourContext(...) var tTokenizer = fContext.CSV


.Tokenize();

Push linearpredictor & RemoveConcat

Stage1

Stage2

Group ops intostages

Rule-based optimizer

S1

S2

<Model Plan>

Logical DAG

19

2. Oven optimizer/compiler build ModelPlan


var fContext = new FlourContext(...)var tTokenizer = fContext.CSV


.Tokenize();


.Concat(tWNgram)


return fPrgrm.Plan();

<Flour Program>

Push linearpredictor & RemoveConcat

Stage1

Stage2

Group ops intostages

Rule-based optimizer

S1

S2

var fContext = new FlourContext(...)var tTokenizer = fContext.CSV


.Tokenize();


.Concat(tWNgram)



var fContext = new FlourContext(...) var tTokenizer = fContext.CSV


.Tokenize();


.Concat(tWNgram)


e.g., Dictionary, N-gramLength

<Model Plan>

Logical DAG

Parameters

Statistics


e.g., dense vs.sparse,maximum vectorsize 20

3. ModelPlan is registered to Runtime


<Model Plan>

Logical DAG

Parameters

Statistics

Physical StagesS1 S2

LogicalStagesModel1

S1

S2

2. Find the most efficient physical impl. using params &stats

1. Store parameters& mapping between

logical stages

22

ObjectStore

3. ModelPlan is registered to Runtime


<Model Plan>

Logical DAG

Parameters

Statistics

Physical StagesS1 S2

Catalog

3. Register selected physical stages to

Catalog

LogicalStagesModel1

S1

S2

ObjectStoreN-gramlength

1vs. 3Sparse vs.Dense

2. Find the most efficient physical impl. using params &stats

1. Store parameters& mapping between

logical stages

23

System Design: OnlinePhase

Runtime

LogicalStagesModel1

S1

S2

Model2

S1’

S2’

4. Send resultback to Client

1. When aprediction request arrives

<Model1, “Pretzel istasty”>

3. Execute stagesusingthread-pools,

managed byScheduler

Physical Stages

2. Instantiate physical stages

along with parameters

ObjectStore

24

25

Outline

• (Black-box) Prediction ServingSystems•Limitations of Black boxapproaches•PRETZEL: White-box Prediction ServingSystem•Evaluation•Conclusion

26

Evaluation

• Q. How PRETZEL improves performance overblack-box approaches?• in terms of latency, memory andthroughput

• 500 Models from Microsoft Machine Learning Team• 250 Sentiment Analysis(Memory-bound)• 250 Attendee Count(Compute-bound)

• System configuration• 16 Cores CPU, 32GBRAM• Windows 10, .Net core2.0

Evaluation: Latency

• Micro-benchmark (No server-clientcommunication)

• Score 250 SentimentAnalysis models 100 times for each

• Compare ML.Net vs.PRETZEL

100 0.6 8.180

(%) 60

CDF 40

20 ML.Net ML.Net

(hot) (cold)010 2 10 1 100 101

Latency (ms, log-scaled)

ML.Net PRETZELP99 (hot)

P99 (cold)

Worst (cold)

ML.Net PRETZELP99 (hot) 0.6

P99 (cold) 8.1Worst (cold)

ML.Net PRETZELP99 (hot) 0.

6

0.2P99 (cold) 8.

1

0.8Worst (cold)

ML.Net PRETZEL

P99 (hot) 0.6 0.2

P99 (cold) 8.1 0.8

Worst (cold) 280.2 6.2

10 2 110 100 101

Latency (ms, log-scaled)

0

20

40

60

80

100

CD

F(%

)

PRETZEL

(hot)

PRETZEL

(cold

) 8.1

0.2 0.80.6

3x

10x

45xbetter

27

• Measure Cumulative Memory Usage after loading 250 models• Attendee Count models (smaller size than SentimentAnalysis)• 4 settings forComparison

Evaluation: Memory

better

Settings Shared

Objects

Shared

Runtime

ML.Net +Clipper

ML.Net ✓PRETZELwithout

ObjectStore✓

PRETZEL ✓ ✓

eg 32GBasU 10GBryed)emoc 1GBlasM-geovti(la 0.1GBlmuCu 10MB

0 50 100 150 200 250Number of pipelines

0Number of pipelines

10MB

0.1GB

1GBCu

mul

ativ

e M

emor

y U

sage

(log

-sca

led)

32GB

10GB

164MB

9.7GB3.7GB

2.9GB 25x 62x

28

PRETZEL

50 100 150 200 250

ML.Net

(w.o.ObjStore)

ML.Net +Clipper

Evaluation:Throughput

131 2 4 8Num. CPU Cores

0

5

10

15(ideal)

(ideal)

Thro

ughp

ut (

KQ

PS)

better

• Micro-benchmark

• Score 250Attendee Count models 1000 times for each

• Request 1000queries in a batch

• Compare ML.Net vs.PRETZEL

10x

Close toideal

scalability

More results

in thepaper!

29

30

Conclusion

• PRETZEL is the first white-box prediction serving system for ML pipelines

• By using models’ structural info, we enable two types of optimizations:• End-to-end optimizations generate efficient execution plans for a model• Multi-model optimizations let models share computation and memory resources

• Our evaluation shows that PRETZELcan improve performance compared to Black-box systems (e.g.,ML.Net)• Decrease latency and memoryfootprint• Increase resource utilization andthroughput

• Alot of external optimizations used by Cipper are orthogonal to PRETZEL

Thank you!

Questions?

Documents

PRETZEL:Openingthe BlackBoxof Machine Learning Prediction ...pages.cs.wisc.edu/~shivaram/cs744-slides/cs744-pretzel-qinyuan.pdf · Machine Learning Prediction Serving 1.Models are