34
Automating Machine Learning Advanced WhizzML Workflows #VSSML16 September 2016 #VSSML16 Automating Machine Learning September 2016 1 / 34

VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Embed Size (px)

Citation preview

Page 1: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Automating Machine LearningAdvanced WhizzML Workflows

#VSSML16

September 2016

#VSSML16 Automating Machine Learning September 2016 1 / 34

Page 2: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Outline

1 Introduction

2 Advanced Workflows

3 A WhizzML Implementation of Best-first Feature Selection

4 Even More Workflows!

5 Stacked Generalization in WhizzML

6 A Brief Look at Gradient Boosting in WhizzML

7 Wrapping Up

#VSSML16 Automating Machine Learning September 2016 2 / 34

Page 3: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Outline

1 Introduction

2 Advanced Workflows

3 A WhizzML Implementation of Best-first Feature Selection

4 Even More Workflows!

5 Stacked Generalization in WhizzML

6 A Brief Look at Gradient Boosting in WhizzML

7 Wrapping Up

#VSSML16 Automating Machine Learning September 2016 3 / 34

Page 4: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

What Do We Know About WhizzML?

• It’s a complete programming language

• Machine learning “operations” are first-class

• Those operations are performed in BigML’s backendI One-line of code to perform API requestsI We get scale “for free”

• Everything is ComposableI FunctionsI LibrariesI The Web Interface

#VSSML16 Automating Machine Learning September 2016 4 / 34

Page 5: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

What Can We Do With It?

• Non-trivial Model SelectionI n-fold cross validationI Comparison of model types (tree, ensemble, logistic)

• Automation of DrudgeryI One-click retraining/validationI Standarized dataset transformations / cleaning

• Sure, but what else?

#VSSML16 Automating Machine Learning September 2016 5 / 34

Page 6: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Outline

1 Introduction

2 Advanced Workflows

3 A WhizzML Implementation of Best-first Feature Selection

4 Even More Workflows!

5 Stacked Generalization in WhizzML

6 A Brief Look at Gradient Boosting in WhizzML

7 Wrapping Up

#VSSML16 Automating Machine Learning September 2016 6 / 34

Page 7: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Algorithms as Workflows

• Many ML algorithms can be thought of as workflows

• In these algorithms, machine learning operations are theprimitives

I Make a modelI Make a predictionI Evaluate a model

• Many such algorithms can be implemented in WhizzMLI Reap the advantages of BigML’s infrastructureI Once implemented, it is language-agnostic

#VSSML16 Automating Machine Learning September 2016 7 / 34

Page 8: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Examples: Best-first Feature Selection

Objective: Select the n best features for modeling your data

• Initialize a set S of used features as the empty set

• Split your dataset into training and test sets

• For i in 1 . . . nI For each feature f not in S, model and evaluate with feature set

S + fI Greedily select f̂ , the feature with the best performance and set

S ← S + f̂

https://github.com/whizzml/examples/tree/master/best-first

#VSSML16 Automating Machine Learning September 2016 8 / 34

Page 9: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Outline

1 Introduction

2 Advanced Workflows

3 A WhizzML Implementation of Best-first Feature Selection

4 Even More Workflows!

5 Stacked Generalization in WhizzML

6 A Brief Look at Gradient Boosting in WhizzML

7 Wrapping Up

#VSSML16 Automating Machine Learning September 2016 9 / 34

Page 10: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Modeling

First, construct a bunch of models. selected is the featuresthat have already been selected, and potentials are thecandidates we might select on this iteration.

(define (make-models dataset-id obj-field selected potentials)(let (model-req {"dataset" dataset-id "objective_field" obj-field}

make-req (lambda (fid)(assoc model-req "input_fields" (cons fid selected)))

all-reqs (map make-req potentials))(create-and-wait* "model" all-reqs)))

#VSSML16 Automating Machine Learning September 2016 10 / 34

Page 11: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Evaluation

Now, conduct the evaluations. potentials is again the listof potential features to add, and model-ids is the list ofcorresponding model-ids created in the last step.

(define (select-feature test-dataset-id potentials model-ids)(let (eval-req {"dataset" test-dataset-id}

make-req (lambda (mid) (assoc eval-req "model" mid))all-reqs (map make-req model-ids)evs (map fetch (create-and-wait* "evaluation" all-reqs))vs (map (lambda (ev) (get-in ev ["result" "model" "average_phi"])) evs)value-map (make-map potentials vs) ;; e.g, {"000000" 0.8 "0000001" 0.7}max-val (get-max vs)choose-best (lambda (id) (if (= max-val (get value-map id)) id false)))

(some choose-best potentials)))

#VSSML16 Automating Machine Learning September 2016 11 / 34

Page 12: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Main Loop

The main loop of the algorithm. Set up your objective id,inputs, and training and test dataset. Initialize the selectedfeatures to the empty set and iteratively call the previous twofunctions.

(define (select-features dataset-id nfeatures)(let (obj-id (dataset-get-objective-id dataset-id)

input-ids (default-inputs dataset-id obj-id)splits (split-dataset dataset-id 0.5)train-id (nth splits 0)test-id (nth splits 1))

(loop (selected []potentials input-ids)

(if (or (>= (count selected) nfeatures) (empty? potentials))(feature-names dataset-id selected)(let (model-ids (make-models dataset-id obj-id selected potentials)

next-feat (select-feature test-id potentials model-ids))(recur (cons next-feat selected)

(filter (lambda (id) (not (= id next-feat))) potentials)))))))

#VSSML16 Automating Machine Learning September 2016 12 / 34

Page 13: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Outline

1 Introduction

2 Advanced Workflows

3 A WhizzML Implementation of Best-first Feature Selection

4 Even More Workflows!

5 Stacked Generalization in WhizzML

6 A Brief Look at Gradient Boosting in WhizzML

7 Wrapping Up

#VSSML16 Automating Machine Learning September 2016 13 / 34

Page 14: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Examples: Stacked Generalization

Objective: Improve predictions by modeling the output scores ofmultiple trained models.

• Create a training and a holdout set

• Create n different models on the training set (with some differenceamong them; e.g., single-tree vs. ensemble vs. logistic regression)

• Make predictions from those models on the holdout set

• Train a model to predict the class based on the other models’predictions

#VSSML16 Automating Machine Learning September 2016 14 / 34

Page 15: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Examples: Randomized Parameter Optimization

Objective: Find the best set of parameters for a machine learningalgorithm

• Do:I Generate a random set of parameters for an ML algorithmI Do 10-fold cross-validation with those parameters

• Until you get a set of parameters that performs “well” or you getbored

#VSSML16 Automating Machine Learning September 2016 15 / 34

Page 16: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Examples: SMACdown

Objective: Find the best set of parameters even more quickly!

• Do:I Generate several random sets of parameters for an ML algorithmI Do 10-fold cross-validation with those parametersI Learn a predictive model to predict performance from parameter

valuesI Use the model to help you select the next set of parameters to

evaluate

• Until you get a set of parameters that performs “well” or you getbored

Coming soon to a WhizzML gallery near you!

#VSSML16 Automating Machine Learning September 2016 16 / 34

Page 17: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Examples: Boosting

• General idea: Iteratively model the datasetI Each iteration is trained on the mistakes of previous iterationsI Said another way, the objective changes each iterationI The final model is a summation of all iterations

• Lots of variations on this themeI AdaboostI LogitboostI Martingale BoostingI Gradient Boosting

• Let’s take a look at a WhizzML implementation of the latter

#VSSML16 Automating Machine Learning September 2016 17 / 34

Page 18: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Outline

1 Introduction

2 Advanced Workflows

3 A WhizzML Implementation of Best-first Feature Selection

4 Even More Workflows!

5 Stacked Generalization in WhizzML

6 A Brief Look at Gradient Boosting in WhizzML

7 Wrapping Up

#VSSML16 Automating Machine Learning September 2016 18 / 34

Page 19: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

A Stacked generalization library: creating the stack

;; Splits the given dataset, using half of it to create;; an heterogeneous collection of models and the other;; half to train a tree that predicts based on those other;; models predictions. Returns a map with the collection;; of models (under the key "models") and the meta-prediction;; as the value of the key "metamodel". The key "result";; has as value a boolean flag indicating whether the;; process was successful.(define (make-stack dataset-id)

(let (ids (split-dataset-and-wait dataset-id 0.5)train-id (nth ids 0)hold-id (nth ids 1)models (create-stack-models train-id)id (create-stack-predictions models hold-id)orig-fields (model-inputs (head models))obj-id (dataset-get-objective-id train-id)meta-id (create-and-wait-model {"dataset" id

"excluded_fields" orig-fields"objective_field" obj-id})

success? (resource-done? (fetch meta-id))){"models" models "metamodel" meta-id "result" success?}))

#VSSML16 Automating Machine Learning September 2016 19 / 34

Page 20: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

A Stacked generalization library: using the stack

;; Use the models and metamodels computed by make-stack;; to make a prediction on the input-data map. Returns;; the identifier of the prediction object.(define (make-stack-prediction models meta-model input-data)

(let (preds (map (lambda (m) (create-prediction {"model" m"input_data" input-data}))

models)preds (map (lambda (p)

(head (values (get (fetch p) "prediction"))))preds)

meta-input (make-map (model-inputs meta-model) preds))(create-prediction {"model" meta-model "input_data" meta-input})))

#VSSML16 Automating Machine Learning September 2016 20 / 34

Page 21: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

A Stacked generalization library: auxiliary functions

;; Extract for a batchpredction its associated dataset of results(define (batch-dataset id)

(wait-forever (get (fetch id) "output_dataset_resource")))

;; Create a batchprediction for the given model and datasets,;; with a map of additional options and using defaults appropriate;; for model stacking(define (make-batch ds-id mod-id opts)

(create-batchprediction (merge {"all_fields" true"output_dataset" true"dataset" ds-id"model" (wait-forever mod-id)}

{})))

;; Auxiliary function extracting the model_inputs of a model(define (model-inputs mod-id)

(get (fetch mod-id) "input_fields"))

#VSSML16 Automating Machine Learning September 2016 21 / 34

Page 22: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

A Stacked generalization library: creating the stack

;; Splits the given dataset, using half of it to create;; an heterogeneous collection of models and the other;; half to train a tree that predicts based on those other;; models predictions. Returns a map with the collection;; of models (under the key "models") and the meta-prediction;; as the value of the key "metamodel". The key "result";; has as value a boolean flag indicating whether the;; process was successful.(define (make-stack dataset-id)

(let (ids (split-dataset-and-wait dataset-id 0.5)train-id (nth ids 0)hold-id (nth ids 1)models (create-stack-models train-id)id (create-stack-predictions models hold-id)orig-fields (model-inputs (head models))obj-id (dataset-get-objective-id train-id)meta-id (create-and-wait-model {"dataset" id

"excluded_fields" orig-fields"objective_field" obj-id})

success? (resource-done? (fetch meta-id))){"models" models "metamodel" meta-id "result" success?}))

#VSSML16 Automating Machine Learning September 2016 22 / 34

Page 23: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Library-based scripts

Script for creating the models(define stack (make-stack dataset-id))

Script for predictions using the stack(define (make-prediction exec-id input-data)

(let (exec (fetch exec-id)stack (nth (head (get-in exec ["execution" "outputs"])) 1)models (get stack "models")metamodel (get stack "metamodel"))

(when (get stack "result")(try (make-stack-prediction models metamodel {})

(catch e (log-info "Error: " e) false)))))

(define prediction-id (make-prediction exec-id input-data))(define prediction (when prediction-id (fetch prediction-id)))

https://github.com/whizzml/examples/tree/master/stacked-generalization

#VSSML16 Automating Machine Learning September 2016 23 / 34

Page 24: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Outline

1 Introduction

2 Advanced Workflows

3 A WhizzML Implementation of Best-first Feature Selection

4 Even More Workflows!

5 Stacked Generalization in WhizzML

6 A Brief Look at Gradient Boosting in WhizzML

7 Wrapping Up

#VSSML16 Automating Machine Learning September 2016 24 / 34

Page 25: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

The Main Loop

• Given the currently predicted class probablilities, compute agradient step that will push those probabilities in the right direction

• Learn regression trees to represent this step over the training set

• Make a prediction with each tree

• Sum this prediction with all gradient steps so far to get a set ofscores for each point in the training data (one score for each class)

• Apply the softmax function to these sums to get a set of classprobabilities for each point.

• Iterate!

Clone it here:https://github.com/whizzml/examples/tree/master/gradient-boosting

#VSSML16 Automating Machine Learning September 2016 25 / 34

Page 26: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

What will this look like in WhizzML?

• Several things here are machine learning operationsI Constructing gradient modelsI Making predictions

• But several are notI Summing the gradient stepsI Computing softmax probabilitiesI Computing gradients

• We don’t want to do those things locally (data size, resourceconcerns)

• Can we do these things on BigML’s infrastructure?

#VSSML16 Automating Machine Learning September 2016 26 / 34

Page 27: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Compute Gradients From Probabilities

• Let’s just focus on computing the gradients for a moment

• Get the predictions from the previous iterationI The sum of all of the previous gradient steps is stored in a columnI If this is the first iteration, assume the uniform distribution

• Gradient for class k is just y − p(k) where y is 1 if the point’s classis k and 0 otherwise.

#VSSML16 Automating Machine Learning September 2016 27 / 34

Page 28: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Computing Gradients

Features Class Matrix Current Probs0.2 10 1 0 0 0.6 0.3 0.10.3 12 0 1 0 0.4 0.4 0.20.15 10 1 0 0 0.8 0.1 0.10.3 -5 0 0 1 0.2 0.3 0.5

#VSSML16 Automating Machine Learning September 2016 28 / 34

Page 29: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Computing Gradients

Features Class Matrix Current Probs Gradients0.2 10 1 0 0 0.6 0.3 0.1 0.4 -0.3 0.10.3 12 0 1 0 0.4 0.4 0.2 -0.4 0.6 -0.20.15 10 1 0 0 0.8 0.1 0.1 0.2 -0.1 -0.10.3 -5 0 0 1 0.2 0.3 0.5 -0.2 -0.3 0.5

#VSSML16 Automating Machine Learning September 2016 29 / 34

Page 30: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Aside: WhizzML + Flatline

• How can we do computations on the data?I Use Flatline: A language for data manipulationI Executed in BigML as a Dataset TransformationI https://github.com/bigmlcom/flatline/blob/master/

user-manual.md

• BenefitsI Abitrary operations on the data are now API callsI Computational details are taken care ofI Upload your data once, do anything to it

• Flatline is a First-class Citizen of WhizzML

#VSSML16 Automating Machine Learning September 2016 30 / 34

Page 31: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Creating a new feature in Flatline

• We need to subtract one column value from another

• Flatline provides the f operator to get a named field value fromany row(- (f "actual") (f "predicted"))

• But remember, if we have n classes, we also have n gradients toconstruct!

• Enter WhizzML!

#VSSML16 Automating Machine Learning September 2016 31 / 34

Page 32: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Compute Gradients: Code

(define (compute-gradient dataset nclasses iteration)(let (next-names (grad-names nclasses iteration)

preds (if (> iteration 0)(map (lambda (n) (flatline "(f {{n}})"))

(softmax-names nclasses iteration))(repeat nclasses (str (/ 1 nclasses))))

tns (truth-names nclasses)fexp (lambda (idx)

(let (actual (nth tns idx)predicted (nth preds idx))

(flatline "(- (f {{actual}}) {predicted})")))new-fields (make-fields next-names (map fexp (range nclasses))))

(add-fields dataset new-fields [])))

#VSSML16 Automating Machine Learning September 2016 32 / 34

Page 33: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

Outline

1 Introduction

2 Advanced Workflows

3 A WhizzML Implementation of Best-first Feature Selection

4 Even More Workflows!

5 Stacked Generalization in WhizzML

6 A Brief Look at Gradient Boosting in WhizzML

7 Wrapping Up

#VSSML16 Automating Machine Learning September 2016 33 / 34

Page 34: VSSML16 L8. Advanced Workflows: Feature Selection, Boosting, Gradient Descent, and Stacking

What Have We Learned?

• You can implement workflows of arbitrary complexity withWhizzML

• The power of WhizzML with Flatline

• Editorial: The Commodification of Machine Learning AlgorithmsI Every language has it’s own ML algorithms nowI With WhizzML, implement once and use anywhereI Never worry about architecture again

#VSSML16 Automating Machine Learning September 2016 34 / 34