31
Hazy: Making Statistical Applications Easier to Build and Maintain Christopher Ré BigLearn Collaborators listed throughout

Hazy: Making Statistical Applications Easier to Build and Maintain

Embed Size (px)

DESCRIPTION

Hazy: Making Statistical Applications Easier to Build and Maintain. Christopher Ré BigLearn Collaborators listed throughout. Big data is the future. Big data is great for vendors and consulting $$$, but is ‘Big’ the heart of the problem?. How big is `big’?. Is it a GB? - PowerPoint PPT Presentation

Citation preview

Page 1: Hazy: Making Statistical Applications Easier to Build and Maintain

Hazy: Making Statistical Applications Easier to Build and Maintain

Christopher RéBigLearn

Collaborators listed throughout

Page 2: Hazy: Making Statistical Applications Easier to Build and Maintain

Big data is the future.

Big data is great for vendors and consulting $$$, but is ‘Big’ the heart of the problem?

Page 3: Hazy: Making Statistical Applications Easier to Build and Maintain

How big is `big’?Is it a GB?

That fits on your phone… but LeCun & Farabet has same

spirit.

Is it a TB?TB is big on a Hadoop…but is main memory in many

apps

Is it a PB?Do you have that problem?

NB: I love Hadoop, Y! and Cloudera.

“Let’s end Peta-philia” – John Doyle

Maybe something other than size is the common thread?

Page 4: Hazy: Making Statistical Applications Easier to Build and Maintain

Big Shifts Underlying Big Data

(1)More signal (data) beats a more complex model nine times out of ten. -- This is why we acquire big data.

(2)Move computation to the data – not data to to the computation.

My bet: The key is the ability to quickly sling signal together – where data live.

Page 5: Hazy: Making Statistical Applications Easier to Build and Maintain

Hazy: Making Statistical Applications Easier to Build and Maintain

Page 6: Hazy: Making Statistical Applications Easier to Build and Maintain

6

Two Trends that Drive Hazy1. Data in unprecedented number of formats

Hazy = statistical + data management techniques

2. Arms race for deeper understanding of data

Statistical tools attack both 1. and 2.

Hazy’s Goal: Understand common patterns in deploying & maintaining statistical tools on data.

Page 7: Hazy: Making Statistical Applications Easier to Build and Maintain

7

Hazy’s Thesis

The next breakthrough in data analysis may not be a new data analysis algorithm…

…but may be in the ability to rapidly combine, deploy, and maintain existing algorithms.

Page 8: Hazy: Making Statistical Applications Easier to Build and Maintain

8

Outline

Three Application Areas for Hazy

Victor/Bismarck (in Parallel RDBMS)

HogWild! (Shared Memory)

Other Hazy Projects

Page 9: Hazy: Making Statistical Applications Easier to Build and Maintain

9

Data constantly generated on the Web, Twitter, Blogs, and Facebook

Extract and Classify sentiment about products, ad campaigns, and customer facing entities.

Often deployed statistical tools: Extraction (CRFs) & Classification (SVM).

DARPA Machine Reading: “List members of the Brazilian Olympic Team in this corpus with years of membership”

Page 10: Hazy: Making Statistical Applications Easier to Build and Maintain

Simplified View of MR Data Flow

Entity Linking Slot Filling

Infrastructure (Felix)

ClueWeb Freebase News Corpus

Mention and EntityFeatures

Pairs of EntityFeatures

Goals: High quality. Low Effort. Easy to test.

500M Docs or 15TB

Page 11: Hazy: Making Statistical Applications Easier to Build and Maintain

Two Key Features our system, Felix1. Felix allows many best-in-breed algorithms to

work jointly together. - Extraction (CRF), Classification (SVM)

2. Simple SQL-like rule-language (Markov Logic)– Felix allows us to leverage Wikipedia, Freebase,

ClueWeb in one high-level language (TBs of data)

Key: Marry simple robust statistical tools with flexible rules – and be able to scale to TBs.

TAC-KBP F1=39. SotA Systems F1=20

Feng Niu, Ce Zhang, Josh Slauson:www.cs.wisc.edu/hazy/felix

Page 12: Hazy: Making Statistical Applications Easier to Build and Maintain

12

Rough Demo!

Page 13: Hazy: Making Statistical Applications Easier to Build and Maintain

A physicist interpolates sensor readings and uses regression to more deeply understand their data

Page 14: Hazy: Making Statistical Applications Easier to Build and Maintain

14

Digital Optical Module (DOM)

Page 15: Hazy: Making Statistical Applications Easier to Build and Maintain

15

Workflow of IceCube

In Ice: Detection occurs.

At Pole: Algorithm says “Interesting!”

In Madison: Lots of data analysis.

Via satellite: Interesting DOM readings

Page 16: Hazy: Making Statistical Applications Easier to Build and Maintain

16

A Key Step: Detecting Track

Mathematical structure used to track neutrinos similar to labeling text.

Mark’s code to the South Pole

in 2012.

Mark Wellons, Ben Recht, Francis Halzen, and Gary Hill

Rules + Simple Regression

NB: Many data analysts are ex-physicists

(Experts: Both are Convex Programs)

Page 17: Hazy: Making Statistical Applications Easier to Build and Maintain

17

OCR and Speech

Getting text is challenging! (statistical model errors)

A social scientist wants to extract the frequency of synonyms of English words in 18th century texts.

Output of speech and OCR models similar to text-labeling models (transducers)

OCR & Speech

Social Scientists @ UW

Want to unify content management systems with RDBMS.

Page 18: Hazy: Making Statistical Applications Easier to Build and Maintain

18

Application Takeaways

Statistical processing on large data enables a wide variety of new applications.

Algorithm (IGMs) that solves bold models close to data in (1) an RDBMS & (2) Shared-Memory.

Goal: Develop the ability to rapidly combine, deploy, and maintain existing algorithms.

Page 19: Hazy: Making Statistical Applications Easier to Build and Maintain

19

Outline

Two Application Areas for Hazy

Victor/Bismarck (in Parallel RDBMS)

HogWild! (Shared Memory)

Other Hazy Projects

Page 20: Hazy: Making Statistical Applications Easier to Build and Maintain

20

Classify publications by subject area

Page 21: Hazy: Making Statistical Applications Easier to Build and Maintain

21

Example: Linear Models

Label papers as DB Papers or Non-DB Papers

2. Classify via plane

1 2

3

45

DB Papers

Non-DB Papers

1. Map each papers to Rdx

Question is: How do we pick a good plane (x)?

Page 22: Hazy: Making Statistical Applications Easier to Build and Maintain

22

Example: Linear Models

Instead: change f to a smooth function of distance to plane.e.g., the squared distance (least squares), hinge loss (svm), log loss (logistic regression). Different f = different model.

Input: Labeled papers. Each point labeled as DB (+) or not (-)

+ -

-

-+

x says DB Papers

x says Non-DB

xIdea: Let’s score each plane

Idea: minimize misclassification.

yi is a paper vector and its label

Page 23: Hazy: Making Statistical Applications Easier to Build and Maintain

Framework: Inverse Problems

Examples: 1. Paper Classification: yi is the paper with its label

2. Neutrino Tracking: yi is a DOM (sensor) reading

3. CRFs: yi is (document, labeling)

4. Netflix: yi is (user,movie,rating), Claim: General data analysis technique that is no more

difficult to compute than a SQL AVG.

x the model

yia data item

f scores the error

P Enforces prior

a

Experts: f and P are convex

Page 24: Hazy: Making Statistical Applications Easier to Build and Maintain

24

Background: Gradient Methods

Gradient Methods: Iterative. 1. Start at current x, 2. Take gradient at x, 3. Move in opposite direction

F(x)

Page 25: Hazy: Making Statistical Applications Easier to Build and Maintain

25

Incremental Gradient Methods

Select a single data item to approximate gradient

Gradient Methods: Iterative. 1. Start at current x, 2. Approximate gradient at x by selecting j in {1…N}, 3. Move in opposite direction

Page 26: Hazy: Making Statistical Applications Easier to Build and Maintain

26

Incremental Gradient Methods (iGMs)

Why use iGMs? iGMs converge to an optimal for many problems, but the real reason is:

iGMs are fast.

Technical Connection: iGM processing isomorphic to computing a SQL AVG

(x is the accumulator, G is an expression on a single tuple)

Solve statistical models in an RDBMS for free

Page 27: Hazy: Making Statistical Applications Easier to Build and Maintain

27

Most RDBMS 3 Functions in a UDA1. Initialize (State)2. Transition(State, data)3. Terminate(State)

iGMs ≈ User Defined Aggregates

State in R2 : (# terms, running total)Transition( (n, T), d ) = (n, T) + (1,d) Terminate( (n,T) ) = T/n

AVG

State in Rd : the model xTransition( x, yj ) = x - G(x,yj) Terminate( x ) = x

Gradient

Page 28: Hazy: Making Statistical Applications Easier to Build and Maintain

28

Some subtle differences…

UDAs typically commutative and algebraic

IGMs are formally neither, but are morally both

Morally Algebraic: One can average models, so morally algebraic [Zinkevich et al 10].

Morally Commutative: Different orders give different exact result, but any order converges to same result.

Key: IGMs work (in parallel) off the shelf in a UDA. (This is how we do Logistic Regression on the Web)

MADLib & Oracle versions coming. Terminology from Gray et al

Page 29: Hazy: Making Statistical Applications Easier to Build and Maintain

29

Hogwild!

[Niu, Recht, Ré, & Wright NIPS 11] Code on Web.

Page 30: Hazy: Making Statistical Applications Easier to Build and Maintain

Other Hazy work• Dealing with data in different formats

– Staccato: Arun Kumar. Query OCR documents [PODS 10 & VLDB 12]– Hazy goes to the South Pole: Mark Wellons Trigger Software for IceCube

• Machine Learning Algorithms– Hogwild run SGD on convex problems with no locking [NIPS 11]– Jellyfish faster than Hogwild! on Matrix Completion [Opt Online 11]

• Maintain Supervised Techniques on Evolving Corpora– iClassify: M. Levent Koc. Maintain classifiers on evolving examples [VLDB 11]– CRFLex: Aaron Feng. Maintain CRFs on Evolving Corpora [ICDE 12]

• Populate Knowledge Bases from Text– Tuffy: SQL + weights to combine multiple models [VLDB 11]– Felix: Running Markov Logic on Millions of Documents [Preprint ARXIV 11]– Feng Niu, Ce Zhang, Josh Slauson,and Adel Ardalan

• Infrastructure with Mike Cafarella (Michigan)– Manimal: Optimizing MapReduce with Relational Operations [VLDB 11]

30

Student who did real work in red.

Page 31: Hazy: Making Statistical Applications Easier to Build and Maintain

31

Conclusion

Many statistical data analyses are no more complicated to compute than a SQL AVG.

Hogwild! approach to multicore parallelism has no locking, but good speedup

Code, data, and papers:www.cs.wisc.edu/hazy