Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
PRETZEL:Opening the Black Box of Machine Learning Prediction ServingSystems
Presented by Qinyuan Sun
Slides are modified from first author Yunseong Lee’s slides
2
Outline
•Prediction Serving Systems
•Limitations of Black Box Approaches•PRETZEL: White-box Prediction ServingSystem•Evaluation•Conclusion
Machine Learning PredictionServing
1. Models are learned fromdata2. Modelsare deployed and served together
Predictionserving
Server UsersData
Training
ModelLearn Deploy
Performance goal:1) Low latency2) High throughput3) Minimal resourceusage
2
• Assumption: models are blackbox
• Re-use the same code in training phase
• Encapsulate alloperations
into a function call (e.g.,pred ic t ( ) )
• Apply externaloptimizations
ML Prediction ServingSystems:
State-of-the-art
Clipper TFServing ML.Net
PredictionServing
System
“Pretzel is tasty”
cat
car
Text
Analysis
Image
Recognition
Result
caching ensemble
4
Replication
Request
Batching
How does ModelsLook Like inside Boxes?
<Example: SentimentAnalysis>
Pretzel is tasty(text)
J vs. L(positive vs.negative)
5
Model
How do ModelsLook inside Boxes?
Pretzel is tasty
Ngram
Word Ngram
<Example: SentimentAnalysis>
Tokenizer Concat J vs. L
DAG ofOperatorsFeaturizers
Char Predictor
Logistic Regression
6
Tokenizer
How do ModelsLook inside Boxes?
<Example: SentimentAnalysis>
Pretzel is tasty
Char Ngram
Word Ngram
ConcatLogistic
Regression
DAG ofOperators
J vs. L
Split text into tokens
Extract N-grams
Merge two vectors
Compute final score
7
Many ModelsHave Similar Structures
• Many part of a model can be re-used in other models
• Customer personalization, Templates, TransferLearning
• Identical set of operators with differentparameters
8
9
Outline
•Prediction Serving Systems•Limitations of Black BoxApproaches
•PRETZEL: White-box Prediction ServingSystem•Evaluation•Conclusion
Limitation 1: ResourceWaste
•Resources are isolated across Blackboxes
1. Unable to share memoryspace
è Waste memory to maintain duplicate
objects (despite similarities between
models)
2. Nocoordination for CPU resources between boxes
è Serving many models can use too many threads
machine
10
Tokenizer
Char Ngram
Word Ngram
Concat Log Reg
Limitation 2: Inconsideration for Ops’Characteristics
1. Operators have different performancecharacteristics• Concat materializes avector• LogReg takes only 0.3% (contrary to the training phase)
2. There can be a better plan if such characteristics are considered• Re-use the existingvectors• Apply in-place update inLogReg
0%20
%
100%
40% 60%80%
Latency breakdown
23.1 34.2 32.7 9.6
CharNgram WordNgram Concat LogReg Others 0.3
11
Limitation 3: Lazy Initialization
• ML.Net initializes code and memory lazily (efficient in training phase)
• Run 250 SentimentAnalysis models 100 times
è cold: first execution / hot: average of the rest99
• Long-tail latency in the coldcase
• Code analysis, Just–in-time (JIT) compilation, memory allocation,etc
• Difficult to provide strong Service-Level-Agreement(SLA)
Tokenizer
Char
Ngram
Word
Ngram
ConcatLog
Reg 13x
444x
12
13
Outline
• (Black-box) Prediction ServingSystems•Limitations of Black BoxApproaches•PRETZEL: White-box Prediction ServingSystem
•Evaluation•Conclusion
14
PRETZEL: White-box PredictionServing
•We analyze models to optimize the internal execution
•We let models co-exist on the same runtime,
sharing computation and memoryresources
•We optimize models in twodirections:
1. End-to-end optimizations
2. Multi-model optimizations
15
End-to-End Optimizations
Optimize the execution of individualmodels from start to end
1. [Ahead-of-time Compilation]Compile operators’ code inadvanceà No JIToverhead
2. [Vector pooling]Pre-allocate data structuresà No memory allocation on the data path
16
Multi-model Optimizations
Share computation and memory acrossmodels
1. [Object Store]Share Operators parameters/weightsà Maintain only onecopy
2. [Sub-plan Materialization]
Reuse intermediate results computed by othermodelsà Save computation
System Components
1. Flour: IntermediateRepresentation
2. Oven: Compiler/Optimizer
var fContext = ...; var Tokenizer = ...; return fPrgm.Plan();
3. Runtime: Execute inferencequeries
Runtime
Object Store
Scheduler
…
4. FrontEnd: Handle userrequests
FrontEnd
17
Prediction Serving withPRETZEL
1. Offline• Analyze structural information ofmodels• Build ModelPlan for optimalexecution• Register ModelPlan toRuntime
2. Online• Handle prediction requests• Coordinate CPU & memoryresources
RuntimeFrontEnd
RuntimeRegister
Model
Analyze
18
Model
Plan
System Design: OfflinePhase
Char Ngram
Tokenizer
Word Ngram
Concat Log Reg
1. Translate Model into FlourProgram
<Model>var fContext = new FlourContext(...) var tTokenizer = fContext.CSV
.FromText(fields, fieldsType, sep)
.Tokenize();
var tCNgram = tTokenizer.CharNgram(numCNgrms, ...);var tWNgram = tTokenizer.WordNgram(numWNgrms, ...);var fPrgrm = tCNgram
.Concat(tWNgram)
.ClassifierBinaryLinear(cParams);
<Flour Program>
return fPrgrm.Plan();18
System Design: OfflinePhase
var tCNgram = tTokenizer.CharNgram(numCNgrms,...); var tWNgram =tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm= tCNgram
.Concat(tWNgram)
.ClassifierBinaryLinear(cParams);return fPrgrm.Plan();
2. Oven optimizer/compiler build ModelPlan
<Flour Program>var fContext = new FlourContext(...) var tTokenizer = fContext.CSV
.FromText(fields, fieldsType, sep)
.Tokenize();
Push linearpredictor & RemoveConcat
Stage1
Stage2
Group ops intostages
Rule-based optimizer
S1
S2
<Model Plan>
Logical DAG
19
2. Oven optimizer/compiler build ModelPlan
System Design: OfflinePhase
var fContext = new FlourContext(...)var tTokenizer = fContext.CSV
.FromText(fields, fieldsType, sep)
.Tokenize();
var tCNgram = tTokenizer.CharNgram(numCNgrms,...); var tWNgram =tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm= tCNgram
.Concat(tWNgram)
.ClassifierBinaryLinear(cParams);
return fPrgrm.Plan();
<Flour Program>
Push linearpredictor & RemoveConcat
Stage1
Stage2
Group ops intostages
Rule-based optimizer
S1
S2
var fContext = new FlourContext(...)var tTokenizer = fContext.CSV
.FromText(fields, fieldsType, sep)
.Tokenize();
var tCNgram = tTokenizer.CharNgram(numCNgrms,...); var tWNgram =tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm= tCNgram
.Concat(tWNgram)
.ClassifierBinaryLinear(cParams);
return fPrgrm.Plan();
var fContext = new FlourContext(...) var tTokenizer = fContext.CSV
.FromText(fields, fieldsType, sep)
.Tokenize();
var tCNgram = tTokenizer.CharNgram(numCNgrms,...); var tWNgram =tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm= tCNgram
.Concat(tWNgram)
.ClassifierBinaryLinear(cParams);
e.g., Dictionary, N-gramLength
<Model Plan>
Logical DAG
Parameters
Statistics
return fPrgrm.Plan();
e.g., dense vs.sparse,maximum vectorsize 20
3. ModelPlan is registered to Runtime
System Design: OfflinePhase
<Model Plan>
Logical DAG
Parameters
Statistics
Physical StagesS1 S2
LogicalStagesModel1
S1
S2
2. Find the most efficient physical impl. using params &stats
1. Store parameters& mapping between
logical stages
22
ObjectStore
3. ModelPlan is registered to Runtime
System Design: OfflinePhase
<Model Plan>
Logical DAG
Parameters
Statistics
Physical StagesS1 S2
Catalog
3. Register selected physical stages to
Catalog
LogicalStagesModel1
S1
S2
ObjectStoreN-gramlength
1vs. 3Sparse vs.Dense
2. Find the most efficient physical impl. using params &stats
1. Store parameters& mapping between
logical stages
23
System Design: OnlinePhase
Runtime
LogicalStagesModel1
S1
S2
Model2
S1’
S2’
4. Send resultback to Client
1. When aprediction request arrives
<Model1, “Pretzel istasty”>
3. Execute stagesusingthread-pools,
managed byScheduler
Physical Stages
2. Instantiate physical stages
along with parameters
ObjectStore
24
25
Outline
• (Black-box) Prediction ServingSystems•Limitations of Black boxapproaches•PRETZEL: White-box Prediction ServingSystem•Evaluation•Conclusion
26
Evaluation
• Q. How PRETZEL improves performance overblack-box approaches?• in terms of latency, memory andthroughput
• 500 Models from Microsoft Machine Learning Team• 250 Sentiment Analysis(Memory-bound)• 250 Attendee Count(Compute-bound)
• System configuration• 16 Cores CPU, 32GBRAM• Windows 10, .Net core2.0
Evaluation: Latency
• Micro-benchmark (No server-clientcommunication)
• Score 250 SentimentAnalysis models 100 times for each
• Compare ML.Net vs.PRETZEL
100 0.6 8.180
(%) 60
CDF 40
20 ML.Net ML.Net
(hot) (cold)010 2 10 1 100 101
Latency (ms, log-scaled)
ML.Net PRETZELP99 (hot)
P99 (cold)
Worst (cold)
ML.Net PRETZELP99 (hot) 0.6
P99 (cold) 8.1Worst (cold)
ML.Net PRETZELP99 (hot) 0.
6
0.2P99 (cold) 8.
1
0.8Worst (cold)
ML.Net PRETZEL
P99 (hot) 0.6 0.2
P99 (cold) 8.1 0.8
Worst (cold) 280.2 6.2
10 2 110 100 101
Latency (ms, log-scaled)
0
20
40
60
80
100
CD
F(%
)
PRETZEL
(hot)
PRETZEL
(cold
) 8.1
0.2 0.80.6
3x
10x
45xbetter
27
• Measure Cumulative Memory Usage after loading 250 models• Attendee Count models (smaller size than SentimentAnalysis)• 4 settings forComparison
Evaluation: Memory
better
Settings Shared
Objects
Shared
Runtime
ML.Net +Clipper
ML.Net ✓PRETZELwithout
ObjectStore✓
PRETZEL ✓ ✓
eg 32GBasU 10GBryed)emoc 1GBlasM-geovti(la 0.1GBlmuCu 10MB
0 50 100 150 200 250Number of pipelines
0Number of pipelines
10MB
0.1GB
1GBCu
mul
ativ
e M
emor
y U
sage
(log
-sca
led)
32GB
10GB
164MB
9.7GB3.7GB
2.9GB 25x 62x
28
PRETZEL
50 100 150 200 250
ML.Net
(w.o.ObjStore)
ML.Net +Clipper
Evaluation:Throughput
131 2 4 8Num. CPU Cores
0
5
10
15(ideal)
(ideal)
Thro
ughp
ut (
KQ
PS)
better
• Micro-benchmark
• Score 250Attendee Count models 1000 times for each
• Request 1000queries in a batch
• Compare ML.Net vs.PRETZEL
10x
Close toideal
scalability
More results
in thepaper!
29
30
Conclusion
• PRETZEL is the first white-box prediction serving system for ML pipelines
• By using models’ structural info, we enable two types of optimizations:• End-to-end optimizations generate efficient execution plans for a model• Multi-model optimizations let models share computation and memory resources
• Our evaluation shows that PRETZELcan improve performance compared to Black-box systems (e.g.,ML.Net)• Decrease latency and memoryfootprint• Increase resource utilization andthroughput
• Alot of external optimizations used by Cipper are orthogonal to PRETZEL
Thank you!
Questions?