Upload
machine-learning-valencia
View
463
Download
2
Embed Size (px)
Citation preview
Morning class summary
Mercè Martín
BigML
Day 2
The Future of ML
José David Martín-Guerrero (IDAL, UV)
Machine learning projectMachine learning project
All steps are connected and feedback is essential to succeed
Society has drifted to the Machine Learning waysocial networks, data acquisition, technologies...
Feature engineering challenges
High space dimensionality (#features >>> #samples)
Inputs preparation: selection, transformation or model direct attack
Modelling strategies: paradox of choice
Too many algorithms and structures, no general purpose one?
Too many con2guration options, no automatic choice?
Select your model by its structure, parameters (tuning) or search algorithm (e.g. Deep learning: no feature engineering but hectic tuning, Azure: many elections)
Wish list: more automation
Work7ows, model selection, tuning, representation, prediction strategies
The Future of ML
The Future of ML
Existing techniques: Reinforcement learning
Environment definable as state-space?
Evolution of this space acted by a set of actors?The Problem is suitable for RL
Goal to be maximized in the long term?
Prior experience Interaction
Environment adaptationPolicy
So far applied to synthetic problems and robotics but also suitable for marketing or medicine, and more to come!
Evaluating ML Algorithms II
GOLDEN RULE: Never use the same example for training the model and
evaluating it!!
What if you don't have so much data? Sample and repeat!
José Hernández-Orallo (UPV)
Under-fitting: too general
How can we detect them? Evaluating
Over-fitting: too specific
Evaluating ML Algorithms II
Training
Data
h1
Test
hn
Evaluation
Evaluation
Learning
Learning
Training
Test
n times
n folds
Cross-validationo We take all possible
combinations with n‒1 for training and the remaining fold for test.
o The error (or any other metric) is calculated n times and then averaged.
o A 2nal model is trained with all the data.
Bootstrapping o We extract n samples with repetition and train with the rest
Evaluating ML Algorithms II
Cost-sensitive evaluations: not all errors are equally costly
Hadamard product = Cost matrix . Confusion matrix
open close
OPEN 0 100€
CLOSE 2000€ 0
Actual
Predicted
c1 open close
OPEN 300 500
CLOSE 200 99000
Actual
Pred
c3 open close
OPEN 400 5400
CLOSE 100 94100
Actual
c2 open close
OPEN 0 0
CLOSE 500 99500
Actual
c1 open close
OPEN 0€ 50,000€
CLOSE 400,000€ 0€
c3 open close
OPEN 0€ 540,000€
CLOSE 200,000€ 0€
c2 open close
OPEN 0€ 0€
CLOSE 1,000,000€ 0€
TOTAL COST: 450,000€
TOTAL COST: 1,000,000€ TOTAL COST:
740,000€
Confusion Matrices
Cost Matrix
Resulting Matrices
External Context:Set of classes
Cost estimation
Confusion matrix & cost matrix can be characterized by just one number: slope
Evaluating ML Algorithms II
ROC (Receiver Operating Characteristic) analysis
Dynamic context (class distribution & cost matrix)
ROC diagram
0 1
1
0
FPR
TPR
o Given several classi2ers: We add the trivial (0,0) (1,1)
classi2ers and construct the convex hull of their points (FPR,TPR). The points in the edges are linear combinations of classi2ers (p * C
a +
(1-p) * Cb)
The classi2ers below the ROC curve are discarded.
The best classi2er (from those remaining) will be selected at application time… slope
Probabilistic context: soft ROC analysis A single classifier with probability-weighted predictions can generate a ROCcurve by changing score threshold(each threshold gives a new classifier in the ROC curve)
Ca
Cb
Evaluating ML Algorithms II
AUC (Area Under the ROC Curve)For crisp classifiers AUC is equivalent to the macro-averaged accuracy.
AUC is a good metric for classifiers and rankers:
A classifier with high AUC is a good ranker.
It is also good for a (uniform) range of operating conditions
A model with very good AUC will have good accuracy for all operating conditions.
A model with very good accuracy for one operating condition can have very bad accuracy for another operating condition.
A classifier with high AUC can have poor calibration(probability estimation).
Multidimensional classifications? ROC problematic, AUC has been extended Regressions? ROC has been extended, AUC is the error variance
Cluster Analysis
K-meansclustering
K=3
Poul Petersen (BigML)
Unsupervised problem (unlabelled data)
Customer segmentation, Item discovery (types),Association (profiles), recommender, active learning (group and label)
Cluster Analysis
• What is the distance to a “missing value”? Defaults replacement
• What is the distance between categorical values? [0,1]
• What is the distance between text features? Vectorize and use cosine distance
• Does it have to be Euclidean distance?
• Unknown “K”?
Distance and centers define the groups: K-means, but...
Problems: Convergence (initial conditions), scaling dimensions
Things you need to tackle:
K-means: starting from a subset of K points, recursively compute the distances
of all points in data to them and associate with the closest.Define the center of each group as new set of K points and repeat until there'sno improvement.
Cluster Analysis
Let K=5
K=5
g-means clustering: increment k looking for the gaussian
Unsupervised Data: Rank by dissimilarity
Why? Unusual instances, intrusion detection, fraud, incorrect data
• Given a group, try to single out the odd: remove outliers from data
Dataset → Anomaly Detector → score → remove outliers
Can use it a diKerent layers and combined with clustering
• Improve model competence: testing predictions score to look for new instances dissimilar to train instances (non-competent model)
• Compare against usual distributions, Gaussian, Benford's Law
Anomaly Detection
Poul Petersen (BigML)
Anomaly Detection
“Round”“Skinny” “Corners”
“Skinny”but not “smooth”
No “Corners”
Not “Round”
Most unusual
Different according to grouping features (prior knowledge)
Anomaly Detection
Grow a random decision tree until each instance is in its own leaf (random features and splits)
“easy” to isolate
“hard” to isolate
Depth
Now repeat the process several times and assign an anomaly score ( 0 = similar , 1 = dissimilar) to any input data by computing how di%erent is the average depth for the instance to the average depth of the training set.
Machine Learning Black Art
Charles Parker (BigML)
Even when you follow theyellow brick road...
Different modelsFeature engineeringEvaluation metrics
The house of horrors awaits youaround the corner:
Huge Hypothesis SpacePoorly Picked Loss FunctionCross ValidationDrifting DomainReliance on Research Results
Machine Learning Black Art
● Huge hypothesis space: the possible classifiers you could build with an
algorithm given the data. Choice!
Triple trade-off
Use non-parametric methodsAs data scales simpler models are desirableBig data often trumps modelling!
● Poorly picked Loss function: standard loss functions (entropy, distance in
formal space) are mathematically convenient but not always enough for real problems
No info about the classes or the costs
False positive in disease diagnosis
False positive in face detection
False positive in thumbprint identification
Path dependence
Game playing
Let developers apply their own loss function: SVM light, plugins in splitting code, customized gradient descent...
OR
Hack the prediction (cascade classifiers)
Change the problem setting (time based limits to the classifier, max loss)
Keep error down with a certain probabilityMore complex: you need more data
Machine Learning Black Art
● Cross-validationhold outs can lead to leakage: features or instances can be correlated in test an train sets. Optimistic performance.
Law of averages and being off by oneFeatures correlated with my predictioncan bias predictions
Photo dating: colors, borders...Beware of the group the instances belong to
Agreggates and timestampsInstances in close moments are very correlated
Machine Learning Black Art
● Drifting Domain
Domain changes (document classification, sales prediction)
Adverse selection of training data (market data predictions, spam)
➢ Prior p(input) is changing → covariate shift➢ Map changes p(output | input) is changing → concept drift
Symptoms: lots of errors, distribution changes. Compare to old data!
● Reliance on Research results
Reality does not comply to theorems' initial boundaries (error, samplecomplexity, convergence)
Rule of thumb:Use academia as your starting point, but don’tthink it will solve all your problems. Keep learning
Reality does not comply to theorems' initial boundaries (error, samplecomplexity, convergence) non-real assumptions
Useful Things about ML
Charles Parker (BigML)
Advice from Dijkstra● Killing Ambitious Projects - identify sub-problems you can tackle
hard vs easy, hacking it's all right. Good candidates:No human experts predict in complex environments (protein folding)Humans can't explain how they know f(x)(character recognition)f(x) is changing all the time (market data)f(x) must be specialized many times (anything user speci2c)
● Ignoring the Lure of ComplexityLook for simplicity (remove spaghetti code, processes, drudgery)Push around complexity (clever compression)Raw data might have information, sometimes is the right way
● Finding Your Own Humility
Know and embrace your own limitsContinuously learnDo A/B test: improve from an existing system
● Avoiding Useless Projects
Look for the best combination of easy and big winDe2ne metrics with experts but don't rely on them: monitor
Useful Things about ML
Advice From DijkstraAdvice From DijkstraAdvice From DijkstraAdvice From Dijkstra (continued)
● Creating a good story
Explain why and summarize your model and your dataStories are more valuable than models
● Continuing to Learn
Don't accommodate, work at the verge of your abilitiesUnderstand your limitationsLearn from your errors
Summary:
ML can be of value for every organization: 2nd where
Locating the right problem, Executing, Showing the proof
When you win we all win, so good luck!!!