DAWN: Infrastructure for Usable Machine Learning · MacroBase(Streaming Data) NoScope(Video) AutoRec, SimDex(Recommendation) Data Fusion Mulligan (SQL+graph+ML) CPU GPU FPGA Cluster

DAWN: Infrastructure for Usable Machine LearningPeter Bailis, Kunle Olukotun, Chris Ré, Matei Zaharia

It’s the Golden Age of DataIncredible advances in image recognition, natural language processing, planning, info retrieval

Society-scale impact: autonomous vehicles, personalized medicine, human trafficking

No end in sight for advances in ML

*

*for the best-funded, best-trained engineering teams

Building ML Products is Too Hard

Major successes (e.g., AlphaGo, ImageNet) require hundreds to thousands of engineers

Huge effort in data preparation, model tuning, experimentation, and productionizing

Domain experts cannot easily or cheaply build ML products

“Only a fraction of real-world ML systemsis composed of ML code”

The DAWN QuestionWhat if anyone with domain expertise could build their own production-quality ML products?• Without a PhD in machine learning• Without being an expert in systems• Without understanding the latest hardware

It’s happened before

It’s happened before: SearchBefore: Decades of research on information retrieval, indexes, ranking, etc

After: any developer can add search to an application by linking a library (e.g. Solr, Lucene); everyone (i.e., non-expert users) uses search

It’s happened before: SQLBefore: raw access to disk, manual layout of records, network databases (CODASYL)

After: SQL forms basis for transactional engines, data warehousing, business intelligence tools

Key idea: end-to-end systems that tackle the barriers to access & production use

The DAWN StackData Acquisition Feature Engineering Model Training Productionizing

Inte

rface

sAl

gorit

hms

Syst

ems

Hard

war

e

…

Snorkel

DeepDive

MacroBase (Streaming Data)

NoScope (Video)

AutoRec, SimDex (Recommendation)

Data Fusion

Mulligan (SQL+graph+ML)

CPU GPU FPGA Cluster Mobile

New Hardware: FuzzyBit, Plasticine CGRA

End-to-End Compilers: Weld, Delite

ModelQAModelSnap

Example: MacroBasefor Continuous Analytics

End-to-end system to prioritize user attention

MacroBasemulti-dimensionaldata streams

anomalies &explanations

Too much data for manual inspectionEven harder when data is streaming

MacroBase SummaryEnd-to-end system to prioritize user attention• No ML expertise needed: MacroBase uses general

models and tunes them automatically• No separate step for production use• Co-design from algorithms to HW

Open source: github.com/stanford-futuredata/macrobase

Early users: automotive, cloud, mobile apps, manufacturing


Inte

rface

sAl

gorit

hms

Syst

ems

Hard

war

e

…

Snorkel

DeepDive


NoScope (Video)


Data Fusion





ModelQAModelSnap

Weld: Rethinking the Interface to Data Analytics Libraries

Standard approach: users combine libraries using function calls that pass data via memory

Problem: for data-intensive apps, data movementcost dominates on modern hardware!

5-30x slowdowns in NumPy, Spark, TensorFlow, …

func1

func2

…

machine learningSQL graph

algorithms

DiverseAnalytics Tasks

CPUs GPUs FPGAsDiverseHardwarePlatforms

Weld IRCommonRuntime

…

Weld’s Approach

Open source: weld.stanford.edu

TPC-H Logistic RegressionVector Sum

Results: Existing Frameworks

0 5

10 15 20 25 30 35 40 45

TPC-H Q1 TPC-H Q6

Ru

ntim

e [

secs

]

Workload

SparkSQLWeld

0 0.02 0.04 0.06 0.08 0.1

0.12 0.14 0.16 0.18 0.2

Runtim

e [se

cs]

NPNExpr

Weld

Integration effort: 500 lines glue, 30 lines/operator

0.1

1

10

100

1000

LR (1T) LR (12T)

Runtim

e [se

cs; lo

g10]

Workload

TFHand-opt

Weld

1 Core 12 Cores

Results: Cross-Library Optimization

0.01

0.1

1

10

100

Runt

ime

(sec,

log1

0)

CurrentWeld, no CLOWeld, CLOWeld, 12 core

Pandas + NumPy

290x

31x

0.0

0.5

1.0

1.5

2.0

Runt

ime

(sec)

Scala UDFWeld

Spark SQL UDF

14x

CLO = cross-library optimizationOpen source: weld.stanford.edu


Inte

rface

sAl

gorit

hms

Syst

ems

Hard

war

e

…

Snorkel

DeepDive


NoScope (Video)


Data Fusion





ModelQAModelSnap

NoScope: Fast CNN-BasedVideo Queries

Opportunity: CNNs allow more accurate queries on visual data than ever

Challenge : processing 1 video in real time requires a $1000 GPU

Result: same accuracy but100-3000x faster through:• Scene-specific distillation• Temporal + spatial locality

bit.ly/NoScopeArxiv


Inte

rface

sAl

gorit

hms

Syst

ems

Hard

war

e

…

Snorkel

DeepDive


NoScope (Video)


Data Fusion





ModelQAModelSnap

Training data is key enabler, barrier to entry

How can we leverage data that’s expensive to label at scale?

github.com/HazyResearch/snorkel

Snorkel’s Approach:Weak Supervision

1) User writes labeling functions: short programs that may not always give right label• E.g. regex to search in text

2) Snorkel simultaneously learns noise in LFs and a noise-aware target model (e.g. LSTM)

4 hours LF coding with bio experts: match months of hand-labeling

high-quality models from low-quality, scalable labeling functions

System NCBI Disease (F1)

CDR Disease(F1)

CDR Chem. (F1)

TaggerOne (Dogan, 2012)* 81.5 79.6 88.4Snorkel: Logistic Regression 79.1 79.6 88.4Snorkel: LSTM + Embeddings 79.2 80.4 88.2

DAWN: machine learning for everyone via novel techniques and interfaces that span hardware, systems, and algorithms

Find out more at dawn.cs.stanford.edu

Peter Bailis Chris Ré Kunle Olukotun Matei Zaharia

Documents

DAWN: Infrastructure for Usable Machine Learning · MacroBase(Streaming Data) NoScope(Video) AutoRec, SimDex(Recommendation) Data Fusion Mulligan (SQL+graph+ML) CPU GPU FPGA Cluster