Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Models in Production
Deriving Knowledge from Data at Scale
Putting an ML Model into Production
bull AB Testing
Deriving Knowledge from Data at Scale
Controlled Experiments in One Slide
Concept is Trivial
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Best Practice AA Test
Run AA tests
before
Deriving Knowledge from Data at Scale
Best Practice Ramp-up
Ramp-up
Deriving Knowledge from Data at Scale
Best Practice Run Experiments at 5050
Deriving Knowledge from Data at Scale
Cost based learning
Deriving Knowledge from Data at Scale
Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning
weighting method
false negatives FNtry to avoid
false negatives
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Preprocess Classify
metaCostSensitiveClassifier
set the FN to 100 FP to 10
tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
curatedcompletely specify a problem measure progress
paired with a metric target SLAs scoreboard
Deriving Knowledge from Data at Scale
This isnrsquot easyhellip
bull Building high quality gold sets is a challenge
bull It is time consuming
bull It requires making difficult and long lasting
choices and the rewards are delayedhellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Models in Production
Deriving Knowledge from Data at Scale
Putting an ML Model into Production
bull AB Testing
Deriving Knowledge from Data at Scale
Controlled Experiments in One Slide
Concept is Trivial
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Best Practice AA Test
Run AA tests
before
Deriving Knowledge from Data at Scale
Best Practice Ramp-up
Ramp-up
Deriving Knowledge from Data at Scale
Best Practice Run Experiments at 5050
Deriving Knowledge from Data at Scale
Cost based learning
Deriving Knowledge from Data at Scale
Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning
weighting method
false negatives FNtry to avoid
false negatives
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Preprocess Classify
metaCostSensitiveClassifier
set the FN to 100 FP to 10
tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
curatedcompletely specify a problem measure progress
paired with a metric target SLAs scoreboard
Deriving Knowledge from Data at Scale
This isnrsquot easyhellip
bull Building high quality gold sets is a challenge
bull It is time consuming
bull It requires making difficult and long lasting
choices and the rewards are delayedhellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Putting an ML Model into Production
bull AB Testing
Deriving Knowledge from Data at Scale
Controlled Experiments in One Slide
Concept is Trivial
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Best Practice AA Test
Run AA tests
before
Deriving Knowledge from Data at Scale
Best Practice Ramp-up
Ramp-up
Deriving Knowledge from Data at Scale
Best Practice Run Experiments at 5050
Deriving Knowledge from Data at Scale
Cost based learning
Deriving Knowledge from Data at Scale
Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning
weighting method
false negatives FNtry to avoid
false negatives
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Preprocess Classify
metaCostSensitiveClassifier
set the FN to 100 FP to 10
tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
curatedcompletely specify a problem measure progress
paired with a metric target SLAs scoreboard
Deriving Knowledge from Data at Scale
This isnrsquot easyhellip
bull Building high quality gold sets is a challenge
bull It is time consuming
bull It requires making difficult and long lasting
choices and the rewards are delayedhellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Controlled Experiments in One Slide
Concept is Trivial
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Best Practice AA Test
Run AA tests
before
Deriving Knowledge from Data at Scale
Best Practice Ramp-up
Ramp-up
Deriving Knowledge from Data at Scale
Best Practice Run Experiments at 5050
Deriving Knowledge from Data at Scale
Cost based learning
Deriving Knowledge from Data at Scale
Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning
weighting method
false negatives FNtry to avoid
false negatives
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Preprocess Classify
metaCostSensitiveClassifier
set the FN to 100 FP to 10
tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
curatedcompletely specify a problem measure progress
paired with a metric target SLAs scoreboard
Deriving Knowledge from Data at Scale
This isnrsquot easyhellip
bull Building high quality gold sets is a challenge
bull It is time consuming
bull It requires making difficult and long lasting
choices and the rewards are delayedhellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Best Practice AA Test
Run AA tests
before
Deriving Knowledge from Data at Scale
Best Practice Ramp-up
Ramp-up
Deriving Knowledge from Data at Scale
Best Practice Run Experiments at 5050
Deriving Knowledge from Data at Scale
Cost based learning
Deriving Knowledge from Data at Scale
Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning
weighting method
false negatives FNtry to avoid
false negatives
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Preprocess Classify
metaCostSensitiveClassifier
set the FN to 100 FP to 10
tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
curatedcompletely specify a problem measure progress
paired with a metric target SLAs scoreboard
Deriving Knowledge from Data at Scale
This isnrsquot easyhellip
bull Building high quality gold sets is a challenge
bull It is time consuming
bull It requires making difficult and long lasting
choices and the rewards are delayedhellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Best Practice Ramp-up
Ramp-up
Deriving Knowledge from Data at Scale
Best Practice Run Experiments at 5050
Deriving Knowledge from Data at Scale
Cost based learning
Deriving Knowledge from Data at Scale
Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning
weighting method
false negatives FNtry to avoid
false negatives
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Preprocess Classify
metaCostSensitiveClassifier
set the FN to 100 FP to 10
tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
curatedcompletely specify a problem measure progress
paired with a metric target SLAs scoreboard
Deriving Knowledge from Data at Scale
This isnrsquot easyhellip
bull Building high quality gold sets is a challenge
bull It is time consuming
bull It requires making difficult and long lasting
choices and the rewards are delayedhellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Best Practice Run Experiments at 5050
Deriving Knowledge from Data at Scale
Cost based learning
Deriving Knowledge from Data at Scale
Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning
weighting method
false negatives FNtry to avoid
false negatives
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Preprocess Classify
metaCostSensitiveClassifier
set the FN to 100 FP to 10
tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
curatedcompletely specify a problem measure progress
paired with a metric target SLAs scoreboard
Deriving Knowledge from Data at Scale
This isnrsquot easyhellip
bull Building high quality gold sets is a challenge
bull It is time consuming
bull It requires making difficult and long lasting
choices and the rewards are delayedhellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Cost based learning
Deriving Knowledge from Data at Scale
Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning
weighting method
false negatives FNtry to avoid
false negatives
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Preprocess Classify
metaCostSensitiveClassifier
set the FN to 100 FP to 10
tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
curatedcompletely specify a problem measure progress
paired with a metric target SLAs scoreboard
Deriving Knowledge from Data at Scale
This isnrsquot easyhellip
bull Building high quality gold sets is a challenge
bull It is time consuming
bull It requires making difficult and long lasting
choices and the rewards are delayedhellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Imbalanced Class Distribution amp Error CostsWEKA cost sensitive learning
weighting method
false negatives FNtry to avoid
false negatives
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Preprocess Classify
metaCostSensitiveClassifier
set the FN to 100 FP to 10
tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
curatedcompletely specify a problem measure progress
paired with a metric target SLAs scoreboard
Deriving Knowledge from Data at Scale
This isnrsquot easyhellip
bull Building high quality gold sets is a challenge
bull It is time consuming
bull It requires making difficult and long lasting
choices and the rewards are delayedhellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Preprocess Classify
metaCostSensitiveClassifier
set the FN to 100 FP to 10
tries to optimize accuracy or error can be cost-sensitivedecision trees rule learner
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
curatedcompletely specify a problem measure progress
paired with a metric target SLAs scoreboard
Deriving Knowledge from Data at Scale
This isnrsquot easyhellip
bull Building high quality gold sets is a challenge
bull It is time consuming
bull It requires making difficult and long lasting
choices and the rewards are delayedhellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Imbalanced Class DistributionWEKA cost sensitive learning
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
curatedcompletely specify a problem measure progress
paired with a metric target SLAs scoreboard
Deriving Knowledge from Data at Scale
This isnrsquot easyhellip
bull Building high quality gold sets is a challenge
bull It is time consuming
bull It requires making difficult and long lasting
choices and the rewards are delayedhellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
curatedcompletely specify a problem measure progress
paired with a metric target SLAs scoreboard
Deriving Knowledge from Data at Scale
This isnrsquot easyhellip
bull Building high quality gold sets is a challenge
bull It is time consuming
bull It requires making difficult and long lasting
choices and the rewards are delayedhellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
curatedcompletely specify a problem measure progress
paired with a metric target SLAs scoreboard
Deriving Knowledge from Data at Scale
This isnrsquot easyhellip
bull Building high quality gold sets is a challenge
bull It is time consuming
bull It requires making difficult and long lasting
choices and the rewards are delayedhellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
This isnrsquot easyhellip
bull Building high quality gold sets is a challenge
bull It is time consuming
bull It requires making difficult and long lasting
choices and the rewards are delayedhellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
enforce a few principles
1 Distribution parity
2 Testing blindness
3 Production parity
4 Single metric
5 Reproducibility
6 Experimentation velocity
7 Data is gold
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
bull Test set blindness
bull Reproducibility and Data is gold
bull Experimentation velocity
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Building Gold sets is hard work Many common and avoidable mistakes aremade This suggests having a checklist Some questions will be trivial toanswer or not applicable some will require workhellip
1 Metrics For each gold set chose one (1) metric Having two metrics on the samegold set is a problem (you canrsquot optimize both at once)
2 WeightingSlicing Not all errors are equal This should be reflected in the metric notthrough sampling manipulation Having the weighting in the metric has twoadvantages 1) it is explicitly documented and reproducible in the form of a metricalgorithm and 2) production train and test sets results remain directly comparable(automatic testing)
3 Yardstick(s) Define algorithms and configuration parameters for public yardstick(s)There could be more than one yardstick A simple yardstick is useful for ramping upOnce one can reproduceunderstand the simple yardstickrsquos result it becomes easierto improve on the latest ldquoproductionrdquo yardstick Ideally yardsticks come withdownloadable code The yardsticks provide a set of errors that suggests whereinnovation should happen
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
4 Sizes and access What are the set sizes Each size corresponds to an innovationvelocity and a level of representativeness A good rule of thumb is 5X size ratiosbetween gold sets drawn from the same distribution Where should the data live Ifon a server some services are needed for access and simple manipulations Thereshould always be a size that is downloadable (lt 1GB) to a desktop for high velocityinnovation
5 Documentation and format Create a formatAPI for the data Is the datacompressed Provide sample code to load the data Document the format Assignsomeone to be the curator of the gold set
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
6 Features What (gold) features go in the gold sets Features must be pickled for result to be reproducible Ideally we would have 2 and possibly 3 types of gold sets
a One set should have the deployed features (computed from the raw data) This provides the production yardstick
b One set should be Raw (eg contains all information possibly through tables) This allows contributors to create features from the raw data to investigate its potential compared to existing features This set has more information per pattern and a smaller number of patterns
c One set should have an extended number of features The additional features may be ldquobuilding blocksrdquo features that are scheduled to be deployed next or high potential features Moving some features to a gold set is convenient if multiple people are working on the next generation Not all features are worth being in a gold set
7 Feature optimization sets Does the data require feature optimization For instance an IP address a query or a listing id may be features But only the most frequent 10M instances are worth having specific trainable parameters A pass over the data can identify the top 10M instance This is a form of feature optimization Identifying these features does not require labels If a form of feature optimization is done a separate data set (disjoint from the training and test set) must be provided
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
8 Stale rate optimization monitoring How long does the set stay current In manycases we hide the fact that the problem is a time series even though the goal is topredict the future and we know that the distribution is changing We must quantifyhow much a distribution changes over a fixed period of time There are several waysto mitigate the changing distribution problem
a Assume the distribution is IID Regularly re-compute training sets and Gold sets Determine thefrequency of re-computation or set in place a system to monitor distribution drifts (monitor KPIchanges while the algorithm is kept constant)
b Decompose the model along ldquodistribution (fast) tracking parametersrdquo and slow tracking parametersThe fast tracking model may be a simple calibration with very few parameters
c Recast the problem as a time series problem patterns are (input data from t-T to t-1 prediction attime t) In this space the patterns are much larger but the problem is closer to being IID
9 The gold sets should have information that reveal the stale rate and allows algorithmsto differentiate themselves based on how they degrade with time
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
10 Grouping Should the patterns be grouped For example in handwriting examples aregrouped per writer A set built by shuffling the words is misleading because trainingand testing would have word examples for the same writer which makesgeneralization much easier If the words are grouped per writers then a writer isunlikely to appear in both training and test set which requires the system to generalizeto never seen before handwriting (as opposed to never seen before words) Do wehave these type of constraints Should we group per advertisers campaign users togeneralize across new instances of these entities (as opposed to generalizing to newqueries) ML requires training and testing to be drawn from the same distributionDrawing duplicates is not a problem Problems arise when one partially drawexamples from the same entity on both training and testing on a small set of entitiesThis breaks the IID assumption and makes the generalization on the test set mucheasier than it actually is
11 Sampling production data What strategy is used for sampling Uniform Are any ofthe following filtered out fraud bad configurations duplicates non-billable adultoverwrites etc Guidance use the production sameness principle
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
11 Unlabeled set If the number of labeled examples is small a large data set ofunlabeled data with the same distribution should be collected and be made a goldset This enables the discovery of new features using intermediate classifiers andactive labeling
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Greatest Challenge in Machine Learning
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train ML Model
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
The greatest challenge in Machine LearningLack of Labelled Training Datahellip
What to Do
bull Controlled Experiments ndash get feedback from user to serve as labels
bull Mechanical Turk ndash pay people to label data to build training set
bull Ask Users to Label Data ndash report as spam lsquohot or notrsquo review a productobserve their click behavior (ad retargeting search results etc)
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
What if you cant get labeled Training Data
Traditional Supervised Learning
bull Promotion on bookseller rsquos web page
bull Customers can rate books
bull Will a new customer like this book
bull Training set observations on previous customers
bull Test set new customers
Whathappensif only few customers rate a book
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Model
Test Data
Prediction
Training Data
Attributes
Target
Label
copy 2013 Datameer Inc All rights reserved
Age Income LikesBook
24 60K +
65 80K -
60 95K -
35 52K +
20 45K +
43 75K +
26 51K +
52 47K -
47 38K -
25 22K -
33 47K +
Age Income LikesBook
22 67K
39 41K
Age Income LikesBook
22 67K +
39 41K -
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Semi-Supervised Learning
Can we make use of the unlabeled data
In theory no
but we can make assumptions
PopularAssumptions
bull Clustering assumption
bull Low density assumption
bull Manifold assumption
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumption
bull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
The ClusteringAssumption
Clustering
bull Partition instances into groups (clusters) of similar
instances
bull Many different algorithms k-Means EM etc
Clustering Assumptionbull The two classification targets are distinct clusters
bull Simple semi-supervised learning cluster then
perform majority vote
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Generative Models
Mixture of Gaussiansbull Assumption the data in each cluster is generated
by a normal distribution
bull Find most probable location and shape of clusters
given data
Expectation-Maximization
bull Two step optimization procedure
bull Keeps estimates of cluster assignment probabilities
for each instance
bull Might converge to local optimum
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
BeyondMixtures of Gaussians
Expectation-Maximization
bull Can be adjusted to all kinds of mixture models
bull Eg use Naive Bayes as mixture model for text classification
Self-Training
bull Learn model on labeled instances only
bull Apply model to unlabeled instances
bull Learn new model on all instances
bull Repeat until convergence
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Assumption
bull The area between the two classes has low density
bull Does not assume any specific form of cluster
Support Vector Machine
bull Decision boundary is linear
bull Maximizes margin to closest instances
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
The Low DensityAssumption
Semi-Supervised SVMbull Minimize distance to labeled and
unlabeled instancesbull Parameter to fine-tune influence of
unlabeled instancesbull Additional constraint keep class balance correct
Implementationbull Simple extension of SVM
bull But non-convex optimization problem
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Semi-Supervised SVM
Stochastic Gradient Descentbull One run over the data in random order
bull Each misclassified or unlabeled instance moves
classifier a bit
bull Steps get smaller over time
Implementation on Hadoopbull Mapper send data to reducer in random order
bull Reducer update linear classifier for unlabeled
or misclassified instances
bull Many random runs to find best one
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
The ManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifoldbull One can perform learning in a more meaningful
low-dimensional spacebull Avoids curse of dimensionality
Similarity Graphs
bull Idea compute similarity scores between instances
bull Create network where the nearest
neighbors are connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more meaningful
low-dimensional space
bull Avoids curse of dimensionality
Similarity Graphsbull Idea compute similarity scores between instances
bull Create a network where the nearest neighbors are
connected
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
TheManifoldAssumption
The Assumption
bull Training data is (roughly) contained in a low
dimensional manifold
bull One can perform learning in a more
meaningful low-dimensional space
bull Avoids curse of dimensionality
SimilarityGraphs
bull Idea compute similarity scores between instances
bullCreate network where the nearest neighbors
are connected
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Label Propagation
Main Ideabull Propagate label information to neighboring instances
bull Then repeat until convergence
bull Similar to PageRank
Theorybull Known to converge under weak conditions
bull Equivalent to matrix inversion
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Conclusion
Semi-Supervised Learningbull Only few training instances have labels
bull Unlabeled instances can still provide valuable signal
Different assumptions lead to different approachesbull Cluster assumption generative models
bull Low density assumption semi-supervised support vector machines
bull Manifold assumption label propagation
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
10 Minute Breakhellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Controlled Experiments
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
bull A
bull B
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
OEC
Overall Evaluation Criterion
Picking a good OEC is key
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
bull Lesson 2 GET THE DATA
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
bull Lesson 2 Get the data
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Lesson 3 Prepare to be humbledLeft Elevator Right Elevator
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
bull Lesson 1
bull Lesson 2
bull Lesson 3
15 Bing
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
bull HiPPO stop the project
From Greg Lindenrsquos Blog httpglindenblogspotcom200604early-amazon-shopping-carthtml
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
TED talk
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
bull Must run statistical tests to confirm differences are not due to chance
bull Best scientific way to prove causality ie the changes in metrics are caused by changes introduced in the treatment(s)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if you think theyrsquore about the same
A B
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
bull A was 85 better
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
A
B
Differences A has taller search box (overall size is the same) has magnifying glass icon
ldquopopular searchesrdquo
B has big search button
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
A B
bull Raise your right hand if you think A Wins
bull Raise your left hand if you think B Wins
bull Donrsquot raise your hand if they are the about the same
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
get the data prepare to be humbled
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Any statistic that appears interesting is almost certainly a mistake
If something is ldquoamazingrdquo find the flaw
Examples
If you have a mandatory birth date field and people think itrsquos
unnecessary yoursquoll find lots of 111111 or 010101
If you have an optional drop down do not default to the first
alphabetical entry or yoursquoll have lots jobs = Astronaut
The previous Office example assumes click maps to revenue
Seemed reasonable but when the results look so extreme find
the flaw (conversion rate is not the same see why)
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Data Trumps Intuition
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Sir Ken Robinson
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
bull OEC = Overall Evaluation Criterion
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
It is difficult to get a man to understand something when his
salary depends upon his not understanding it
-- Upton Sinclair
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Hubris
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull Semmelweis worked at Viennarsquos General Hospital animportant teachingresearch hospital in the 1830s-40s
bull In 19th-century Europe childbed fever killed more than a million women
bull Measurement the mortality rate for women giving birth was
bull 15 in his ward staffed by doctors and students
bull 2 in the ward at the hospital attended by midwives
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Cultural Stage 2Insight through Measurement and Control
bull He tried to control all differences
bull Birthing positions ventilation diet even the way laundry was done
bull He was away for 4 months and death rate fell significantly when he was away Could it be related to him
bull Insight
bull Doctors were performing autopsies each morning on cadavers
bull Conjecture particles (called germs today) were being transmitted to healthy patients on the hands of the physicians
bull He experiments with cleansing agents
bull Chlorine and lime was effective death rate fell from 18 to 1
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Semmelweis Reflex
bull Semmelweis Reflex
2005 study inadequate hand washing is one of the prime contributors to the 2 million health-care-associated infections and 90000 related deaths annually in the United States
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Fundamental Understanding
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
HubrisMeasure and
Control
Accept Results
avoid
Semmelweis
Reflex
Fundamental
Understanding
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
bull Controlled Experiments in one slide
bull Examples yoursquore the decision maker
bull Cultural evolution hubris insight through measurement Semmelweis reflex fundamental understanding
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
bull Real Data for the city of Oldenburg
Germany
bull X-axis stork population
bull Y-axis human population
What your mother told you about babies and
storks when you were three is still not right
despite the strong correlational ldquoevidencerdquo
Ornitholigische Monatsberichte 193644(2)
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Women have smaller palms and live 6 years longer
on average
Buthellipdonrsquot try to bandage your hands
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
causal
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
If you dont know where you are going any road will take you there
mdashLewis Carroll
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
before
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
bull Hippos kill more humans than any other (non-human) mammal (really)
bull OEC
Get the data
bull Prepare to be humbled
The less data the stronger the opinionshellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Out of Class Reading
Eight (8) page conference paper
40 page journal versionhellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Course ProjectDue Oct 25th
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Open Discussion on Course Projecthellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Gallery of Experiments
Contributed by the community
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Azure Machine Learning Studio
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Sample
Experiments
To help you get started
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Experiment
Tools that you can use in your
experiment For feature
selection large set of machine
learning algorithms
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Using
classificatio
n
algorithms
Evaluating
the model
Splitting to
Training
and Testing
Datasets
Getting
Data
For the
Experiment
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scalehttpgalleryazuremlnetbrowsetags=[22Azure20ML20Book22
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at ScaleCustomer Churn Model
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deployed web service endpoints
that can be consumed by applications
and for batch processing
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature andor
Target
construction
1 Define the objective and quantify it with a metric ndash optionally with constraints
if any This typically requires domain knowledge
2 Collect and understand the data deal with the vagaries and biases in the data
acquisition (missing data outliers due to errors in the data collection process
more sophisticated biases due to the data collection procedure etc
3 Frame the problem in terms of a machine learning problem ndash classification
regression ranking clustering forecasting outlier detection etc ndash some
combination of domain knowledge and ML knowledge is useful
4 Transform the raw data into a ldquomodeling datasetrdquo with features weights
targets etc which can be used for modeling Feature construction can often
be improved with domain knowledge Target must be identical (or a very
good proxy) of the quantitative metric identified step 1
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train Test split
5 Train test and evaluate taking care to control
biasvariance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here) be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) ndash this is the
ML heavy step
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand
the data
Pre-processing
Feature andor
Target
construction
Feature selection
Model training
Model scoring
Evaluation
Train Test split
6 Iterate steps (2) ndash (5) until the test metrics are satisfactory
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Access Data
Pre-processing
Feature
construction
Model scoring
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Book
Recommendation
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip
Deriving Knowledge from Data at Scale
Thatrsquos all for our coursehellip