LR2. Summary Day 2

Morning class summary

Mercè Martín

BigML

Day 2

The Future of ML

José David Martín-Guerrero (IDAL, UV)

Machine learning projectMachine learning project

All steps are connected and feedback is essential to succeed

Society has drifted to the Machine Learning waysocial networks, data acquisition, technologies...

Feature engineering challenges

High space dimensionality (#features >>> #samples)

Inputs preparation: selection, transformation or model direct attack

Modelling strategies: paradox of choice

Too many algorithms and structures, no general purpose one?

Too many con2guration options, no automatic choice?

Select your model by its structure, parameters (tuning) or search algorithm (e.g. Deep learning: no feature engineering but hectic tuning, Azure: many elections)

Wish list: more automation

Work7ows, model selection, tuning, representation, prediction strategies

The Future of ML

The Future of ML

Existing techniques: Reinforcement learning

Environment definable as state-space?

Evolution of this space acted by a set of actors?The Problem is suitable for RL

Goal to be maximized in the long term?

Prior experience Interaction

Environment adaptationPolicy

So far applied to synthetic problems and robotics but also suitable for marketing or medicine, and more to come!

Evaluating ML Algorithms II

GOLDEN RULE: Never use the same example for training the model and

evaluating it!!

What if you don't have so much data? Sample and repeat!

José Hernández-Orallo (UPV)

Under-fitting: too general

How can we detect them? Evaluating

Over-fitting: too specific


Training

Data

h1

Test

hn

Evaluation

Evaluation

Learning

Learning

Training

Test

n times

n folds

Cross-validationo We take all possible

combinations with n‒1 for training and the remaining fold for test.

o The error (or any other metric) is calculated n times and then averaged.

o A 2nal model is trained with all the data.

Bootstrapping o We extract n samples with repetition and train with the rest


Cost-sensitive evaluations: not all errors are equally costly

Hadamard product = Cost matrix . Confusion matrix

open close

OPEN 0 100€

CLOSE 2000€ 0

Actual

Predicted

c1 open close

OPEN 300 500

CLOSE 200 99000

Actual

Pred

c3 open close

OPEN 400 5400

CLOSE 100 94100

Actual

c2 open close

OPEN 0 0

CLOSE 500 99500

Actual

c1 open close

OPEN 0€ 50,000€

CLOSE 400,000€ 0€

c3 open close

OPEN 0€ 540,000€

CLOSE 200,000€ 0€

c2 open close

OPEN 0€ 0€

CLOSE 1,000,000€ 0€

TOTAL COST: 450,000€

TOTAL COST: 1,000,000€ TOTAL COST:

740,000€

Confusion Matrices

Cost Matrix

Resulting Matrices

External Context:Set of classes

Cost estimation

Confusion matrix & cost matrix can be characterized by just one number: slope


ROC (Receiver Operating Characteristic) analysis

Dynamic context (class distribution & cost matrix)

ROC diagram

0 1

1

0

FPR

TPR

o Given several classi2ers: We add the trivial (0,0) (1,1)

classi2ers and construct the convex hull of their points (FPR,TPR). The points in the edges are linear combinations of classi2ers (p * C

a +

(1-p) * Cb)

The classi2ers below the ROC curve are discarded.

The best classi2er (from those remaining) will be selected at application time… slope

Probabilistic context: soft ROC analysis A single classifier with probability-weighted predictions can generate a ROCcurve by changing score threshold(each threshold gives a new classifier in the ROC curve)

Ca

Cb


AUC (Area Under the ROC Curve)For crisp classifiers AUC is equivalent to the macro-averaged accuracy.

AUC is a good metric for classifiers and rankers:

A classifier with high AUC is a good ranker.

It is also good for a (uniform) range of operating conditions

A model with very good AUC will have good accuracy for all operating conditions.

A model with very good accuracy for one operating condition can have very bad accuracy for another operating condition.

A classifier with high AUC can have poor calibration(probability estimation).

Multidimensional classifications? ROC problematic, AUC has been extended Regressions? ROC has been extended, AUC is the error variance

Cluster Analysis

K-meansclustering

K=3

Poul Petersen (BigML)

Unsupervised problem (unlabelled data)

Customer segmentation, Item discovery (types),Association (profiles), recommender, active learning (group and label)

Cluster Analysis

• What is the distance to a “missing value”? Defaults replacement

• What is the distance between categorical values? [0,1]

• What is the distance between text features? Vectorize and use cosine distance

• Does it have to be Euclidean distance?

• Unknown “K”?

Distance and centers define the groups: K-means, but...

Problems: Convergence (initial conditions), scaling dimensions

Things you need to tackle:

K-means: starting from a subset of K points, recursively compute the distances

of all points in data to them and associate with the closest.Define the center of each group as new set of K points and repeat until there'sno improvement.

Cluster Analysis

Let K=5

K=5

g-means clustering: increment k looking for the gaussian

Unsupervised Data: Rank by dissimilarity

Why? Unusual instances, intrusion detection, fraud, incorrect data

• Given a group, try to single out the odd: remove outliers from data

Dataset → Anomaly Detector → score → remove outliers

Can use it a diKerent layers and combined with clustering

• Improve model competence: testing predictions score to look for new instances dissimilar to train instances (non-competent model)

• Compare against usual distributions, Gaussian, Benford's Law

Anomaly Detection

Poul Petersen (BigML)

Anomaly Detection

“Round”“Skinny” “Corners”

“Skinny”but not “smooth”

No “Corners”

Not “Round”

Most unusual

Different according to grouping features (prior knowledge)

Anomaly Detection

Grow a random decision tree until each instance is in its own leaf (random features and splits)

“easy” to isolate

“hard” to isolate

Depth

Now repeat the process several times and assign an anomaly score ( 0 = similar , 1 = dissimilar) to any input data by computing how di%erent is the average depth for the instance to the average depth of the training set.

Machine Learning Black Art

Charles Parker (BigML)

Even when you follow theyellow brick road...

Different modelsFeature engineeringEvaluation metrics

The house of horrors awaits youaround the corner:

Huge Hypothesis SpacePoorly Picked Loss FunctionCross ValidationDrifting DomainReliance on Research Results


● Huge hypothesis space: the possible classifiers you could build with an

algorithm given the data. Choice!

Triple trade-off

Use non-parametric methodsAs data scales simpler models are desirableBig data often trumps modelling!

● Poorly picked Loss function: standard loss functions (entropy, distance in

formal space) are mathematically convenient but not always enough for real problems

No info about the classes or the costs

False positive in disease diagnosis

False positive in face detection

False positive in thumbprint identification

Path dependence

Game playing

Let developers apply their own loss function: SVM light, plugins in splitting code, customized gradient descent...

OR

Hack the prediction (cascade classifiers)

Change the problem setting (time based limits to the classifier, max loss)

Keep error down with a certain probabilityMore complex: you need more data


● Cross-validationhold outs can lead to leakage: features or instances can be correlated in test an train sets. Optimistic performance.

Law of averages and being off by oneFeatures correlated with my predictioncan bias predictions

Photo dating: colors, borders...Beware of the group the instances belong to

Agreggates and timestampsInstances in close moments are very correlated


● Drifting Domain

Domain changes (document classification, sales prediction)

Adverse selection of training data (market data predictions, spam)

➢ Prior p(input) is changing → covariate shift➢ Map changes p(output | input) is changing → concept drift

Symptoms: lots of errors, distribution changes. Compare to old data!

● Reliance on Research results

Reality does not comply to theorems' initial boundaries (error, samplecomplexity, convergence)

Rule of thumb:Use academia as your starting point, but don’tthink it will solve all your problems. Keep learning

Reality does not comply to theorems' initial boundaries (error, samplecomplexity, convergence) non-real assumptions

Useful Things about ML

Charles Parker (BigML)

Advice from Dijkstra● Killing Ambitious Projects - identify sub-problems you can tackle

hard vs easy, hacking it's all right. Good candidates:No human experts predict in complex environments (protein folding)Humans can't explain how they know f(x)(character recognition)f(x) is changing all the time (market data)f(x) must be specialized many times (anything user speci2c)

● Ignoring the Lure of ComplexityLook for simplicity (remove spaghetti code, processes, drudgery)Push around complexity (clever compression)Raw data might have information, sometimes is the right way

● Finding Your Own Humility

Know and embrace your own limitsContinuously learnDo A/B test: improve from an existing system

● Avoiding Useless Projects

Look for the best combination of easy and big winDe2ne metrics with experts but don't rely on them: monitor

Useful Things about ML

Advice From DijkstraAdvice From DijkstraAdvice From DijkstraAdvice From Dijkstra (continued)

● Creating a good story

Explain why and summarize your model and your dataStories are more valuable than models

● Continuing to Learn

Don't accommodate, work at the verge of your abilitiesUnderstand your limitationsLearn from your errors

Summary:

ML can be of value for every organization: 2nd where

Locating the right problem, Executing, Showing the proof

When you win we all win, so good luck!!!

Data & Analytics

LR2. Summary Day 2